Problem Statement¶
Business Context¶
Business communities in the United States are facing high demand for human resources, but one of the constant challenges is identifying and attracting the right talent, which is perhaps the most important element in remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals both locally as well as abroad.
The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on their wages or working conditions by ensuring US employers' compliance with statutory requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).
OFLC processes job certification applications for employers seeking to bring foreign workers into the United States and grants certifications in those cases where employers can demonstrate that there are not sufficient US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the area of intended employment.
Objective¶
In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and permanent labor certifications. This was a nine percent increase in the overall number of processed applications from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants is increasing every year.
The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired the firm EasyVisa for data-driven solutions. You as a data scientist at EasyVisa have to analyze the data provided and, with the help of a classification model:
- Facilitate the process of visa approvals.
- Recommend a suitable profile for the applicants for whom the visa should be certified or denied based on the drivers that significantly influence the case status.
Data Description¶
The data contains the different attributes of employee and the employer. The detailed data dictionary is given below.
- case_id: ID of each visa application
- continent: Information of continent the employee
- education_of_employee: Information of education of the employee
- has_job_experience: Does the employee has any job experience? Y= Yes; N = No
- requires_job_training: Does the employee require any job training? Y = Yes; N = No
- no_of_employees: Number of employees in the employer's company
- yr_of_estab: Year in which the employer's company was established
- region_of_employment: Information of foreign worker's intended region of employment in the US.
- prevailing_wage: Average wage paid to similarly employed workers in a specific occupation in the area of intended employment. The purpose of the prevailing wage is to ensure that the foreign worker is not underpaid compared to other workers offering the same or similar service in the same area of employment.
- unit_of_wage: Unit of prevailing wage. Values include Hourly, Weekly, Monthly, and Yearly.
- full_time_position: Is the position of work full-time? Y = Full Time Position; N = Part Time Position
- case_status: Flag indicating if the Visa was certified or denied
Installing and Importing the necessary libraries¶
# Check Python version
import sys
print(sys.version)
3.10.12 (main, Aug 15 2025, 14:32:43) [GCC 11.4.0]
# Upgrade build tools first
%pip install -q --upgrade pip setuptools wheel
# Py3.12-compatible pins
%pip install -q \
numpy==2.0.2 \
pandas==2.2.2 \
matplotlib==3.8.4 \
seaborn==0.13.2 \
scikit-learn==1.6.1 \
sklearn-pandas==2.2.0 \
xgboost==2.0.3
Note: you may need to restart the kernel to use updated packages. Note: you may need to restart the kernel to use updated packages.
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the below.
# Core data science stack
try:
import os
import sys
import pandas as pd
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
except ImportError as e:
print(f"[ERROR] Core package missing: {e.name}.")
print("Solution: Install with pip, e.g.: !pip install pandas numpy matplotlib seaborn")
# Utilities
try:
from pathlib import Path
# from google.colab import drive
from datetime import datetime
from collections import Counter
except ImportError as e:
print(f"[ERROR] Utility package missing: {e.name}.")
print("Solution: If running outside Colab, remove or replace 'google.colab' imports.")
# Version verification
try:
import pandas, numpy, matplotlib, seaborn, sklearn, xgboost
print("\n[OK] Installed package versions:")
print(f" * Python: {sys.version.split()[0]}")
print(f" * pandas: {pandas.__version__}")
print(f" * numpy: {numpy.__version__}")
print(f" * matplotlib: {matplotlib.__version__}")
print(f" * seaborn: {seaborn.__version__}")
print(f" * scikit-learn: {sklearn.__version__}")
print(f" * xgboost: {xgboost.__version__}")
except Exception as e:
print("[ERROR] Could not verify all package versions:", e)
print("Double-check that packages are installed with pip/conda.")
[OK] Installed package versions: * Python: 3.10.12 * pandas: 2.2.2 * numpy: 2.0.2 * matplotlib: 3.8.4 * seaborn: 0.13.2 * scikit-learn: 1.6.1 * xgboost: 2.0.3
Import Dataset¶
# Primary WSL path Windows C: mounted at /mnt/c
p1 = Path("/mnt/c/Users/caste/OneDrive/Desktop/MLE/AI ML/EasyVisa.csv")
# Fallback in case
p2 = Path("/mnt/c/Users/caste/Desktop/MLE/AI ML/EasyVisa.csv")
data_path = p1 if p1.exists() else (p2 if p2.exists() else None)
print("[INFO] Candidates:")
print(" -", p1)
print(" -", p2)
if data_path is None:
raise FileNotFoundError(
"[ERROR] Could not find EasyVisa.csv at either location above.\n"
"Checks:\n"
" • Confirm the file is not cloud-only in OneDrive.\n"
" • Verify the path inside WSL\n"
" • If stored elsewhere, update the path here."
)
print(f"[OK] Using: {data_path}")
# Loader with encoding fallback
def read_csv_safely(path: Path, **kwargs):
try:
return pd.read_csv(path, **kwargs)
except UnicodeDecodeError:
print(f"[WARN] UnicodeDecodeError for {path.name}. Retrying with latin-1 …")
return pd.read_csv(path, encoding="latin-1", engine="python", **kwargs)
except pd.errors.ParserError as e:
print(f"[WARN] ParserError for {path.name}: {e}\nTrying engine='python' with on_bad_lines='skip'.")
return pd.read_csv(path, engine="python", on_bad_lines="skip", **kwargs)
# Load
df = read_csv_safely(data_path)
print(f"[OK] Loaded {data_path.name}")
[INFO] Candidates: - /mnt/c/Users/caste/OneDrive/Desktop/MLE/AI ML/EasyVisa.csv - /mnt/c/Users/caste/Desktop/MLE/AI ML/EasyVisa.csv [OK] Using: /mnt/c/Users/caste/OneDrive/Desktop/MLE/AI ML/EasyVisa.csv [OK] Loaded EasyVisa.csv
Overview of the Dataset¶
View the first and last 5 rows of the dataset¶
# Dataset first 5 rows
display(df.head()
.style.set_properties(**{'text-align': 'left'}).set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
)
| case_id | continent | education_of_employee | has_job_experience | requires_job_training | no_of_employees | yr_of_estab | region_of_employment | prevailing_wage | unit_of_wage | full_time_position | case_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | EZYV21644 | Europe | Bachelor's | N | Y | 3126 | 1800 | South | 192806.060000 | Year | Y | Certified |
| 1 | EZYV24364 | Asia | Master's | Y | N | 1649 | 1800 | Midwest | 148321.570000 | Year | Y | Certified |
| 2 | EZYV22098 | Asia | Bachelor's | Y | N | 2043 | 1800 | Northeast | 145985.430000 | Year | Y | Certified |
| 3 | EZYV12103 | Asia | High School | N | Y | 808 | 1800 | West | 127303.960000 | Year | Y | Denied |
| 4 | EZYV13156 | Asia | Master's | Y | N | 573 | 1800 | Northeast | 124457.500000 | Year | Y | Certified |
# Dataset last 5 rows
display(df.tail()
.style.set_properties(**{'text-align': 'left'}).set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
)
| case_id | continent | education_of_employee | has_job_experience | requires_job_training | no_of_employees | yr_of_estab | region_of_employment | prevailing_wage | unit_of_wage | full_time_position | case_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 25475 | EZYV5163 | Asia | Bachelor's | Y | N | 500 | 2016 | Northeast | 12476.400000 | Year | Y | Certified |
| 25476 | EZYV15379 | Europe | Doctorate | N | N | 41 | 2016 | Northeast | 9442.230000 | Year | Y | Certified |
| 25477 | EZYV15437 | Europe | Bachelor's | Y | N | 1338 | 2016 | West | 639.957100 | Hour | Y | Denied |
| 25478 | EZYV20413 | Asia | High School | N | N | 1655 | 2016 | Northeast | 538.996600 | Hour | Y | Denied |
| 25479 | EZYV8640 | Europe | Master's | N | N | 2089 | 2016 | West | 399.629700 | Hour | Y | Denied |
Understand the shape of the dataset¶
# Print the rows and coulmns
print(f"\n[INFO] Dataset shape: {df.shape}")
[INFO] Dataset shape: (25480, 12)
Check the data types of the columns for the dataset¶
# Dataset info
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25480 entries, 0 to 25479 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 case_id 25480 non-null object 1 continent 25480 non-null object 2 education_of_employee 25480 non-null object 3 has_job_experience 25480 non-null object 4 requires_job_training 25480 non-null object 5 no_of_employees 25480 non-null int64 6 yr_of_estab 25480 non-null int64 7 region_of_employment 25480 non-null object 8 prevailing_wage 25480 non-null float64 9 unit_of_wage 25480 non-null object 10 full_time_position 25480 non-null object 11 case_status 25480 non-null object dtypes: float64(1), int64(2), object(9) memory usage: 2.3+ MB
Exploratory Data Analysis (EDA)¶
Learned to set the data up before hand to avoid pitfalls later down the pipeline
Rock-solid config & verification (RNG, target integrity, reproducibility)
Canonical maps for messy strings (target, Y/N flags, categories)
Schema guard prints unknown headers, missing canon columns, and collisions
Single build_canonical_df(df) entrypoint returning a clean, model-ready frame
Stratified split sanity and class-balance report
# A) CONFIG & VERIFICATION (Pro)
import re, difflib
from typing import Dict, Any
# --- Config ---
RNG = 72
np.random.seed(RNG)
TARGET_RAW = "case_status" # raw target column in EasyVisa
TARGET_LBL = "_case_label" # canonical string label: 'Certified' / 'Denied'
TARGET_BIN = "y" # numeric target: 1/0
# Minimal raw columns expected before any feature engineering
EXPECTED_COLS = {
"case_id","continent","education_of_employee","has_job_experience",
"requires_job_training","no_of_employees","yr_of_estab",
"region_of_employment","prevailing_wage","unit_of_wage",
"full_time_position","case_status"
}
# --- Utilities ---
def verify_rng(seed: int = RNG) -> None:
"""Deterministically verify RNG reproducibility."""
a = np.random.RandomState(seed).rand(5)
b = np.random.RandomState(seed).rand(5)
assert np.allclose(a, b), "RNG reproducibility failed"
print("[OK] RNG reproducibility")
def verify_target_exists(df: pd.DataFrame) -> None:
"""Ensure the raw target column is present."""
assert TARGET_RAW in df.columns, f"Missing target column `{TARGET_RAW}`"
print("[OK] Target column present")
def report_class_balance(y: pd.Series, title: str = "[INFO] class balance") -> Dict[str, Any]:
"""Print and return class balance + imbalance ratio (if binary)."""
vc = y.value_counts(dropna=False)
ratio = float(vc.max()/vc.min()) if (len(vc) == 2 and vc.min() > 0) else float("nan")
print(f"{title} →")
print(vc.to_string())
if np.isfinite(ratio):
print(f" Imbalance ratio ≈ {ratio:.2f}:1")
return {"counts": vc.to_dict(), "ratio": ratio}
def verify_expected_columns(df: pd.DataFrame) -> Dict[str, Any]:
"""Check presence of all expected raw columns; print concise status."""
missing = sorted([c for c in EXPECTED_COLS if c not in df.columns])
extra = sorted([c for c in df.columns if c not in EXPECTED_COLS])
if missing:
print(f"[WARN] Missing expected columns: {missing}")
else:
print("[OK] All expected columns present")
if extra:
print(f"[INFO] Extra columns (not required): {extra[:10]}{' …' if len(extra) > 10 else ''}")
return {"missing": missing, "extra": extra, "passed": len(missing) == 0}
def verify_no_nans(series: pd.Series, name: str) -> None:
"""Hard guard to ensure no NaNs in a required series."""
n = int(series.isna().sum())
print(f"[VERIFY] {name}: NaN count = {n}")
assert n == 0, f"Found NaNs in `{name}`"
# --- One-call dataset verifier ---
def verify_dataset(df: pd.DataFrame, name: str = "EasyVisa", strict: bool = False) -> Dict[str, Any]:
"""
End-to-end verification for the raw EasyVisa frame.
- RNG reproducibility
- Target column exists
- Expected raw columns present
- (Optional) class balance if a canonical/encoded target is available
Returns a dict report; raises if strict=True and checks fail.
"""
print(f"\n=== VERIFICATION START: {name} ===")
verify_rng()
verify_target_exists(df)
schema_rep = verify_expected_columns(df)
# If the notebook has already created canonical target columns, report them
report_lbl = {}
report_bin = {}
if TARGET_LBL in df.columns:
report_lbl = report_class_balance(df[TARGET_LBL], "[INFO] target label balance")
if TARGET_BIN in df.columns:
report_bin = report_class_balance(df[TARGET_BIN], "[INFO] target numeric balance")
passed = bool(schema_rep["passed"])
summary = {
"dataset": name,
"rows": int(len(df)),
"cols": int(df.shape[1]),
"expected_ok": passed,
"missing": schema_rep["missing"],
"extra": schema_rep["extra"],
"label_balance": report_lbl,
"numeric_balance": report_bin,
"passed": passed,
}
status = "PASS ✓" if summary["passed"] else "FAIL ✗"
print(f"[SUMMARY] rows={summary['rows']} cols={summary['cols']} | "
f"expected_ok={summary['expected_ok']} ⇒ {status}")
print(f"=== VERIFICATION END: {name} ===\n")
if strict and not summary["passed"]:
raise AssertionError(f"[{name}] Verification failed: {summary}")
return summary
report = verify_dataset(df, name="EasyVisa Raw", strict=True)
=== VERIFICATION START: EasyVisa Raw === [OK] RNG reproducibility [OK] Target column present [OK] All expected columns present [SUMMARY] rows=25480 cols=12 | expected_ok=True ⇒ PASS ✓ === VERIFICATION END: EasyVisa Raw ===
# B) CANONICALIZATION
from typing import Dict, Any, Tuple
import difflib, re
# Helpers
_norm = lambda s: re.sub(r"[^a-z0-9]", "", str(s).lower())
_CANON_MAP: Dict[str,str] = {
"caseid": "case_id",
"continent": "continent",
"educationofemployee": "education_of_employee",
"hasjobexperience": "has_job_experience",
"requiresjobtraining": "requires_job_training",
"noofemployees": "no_of_employees",
"yrofestab": "yr_of_estab",
"regionofemployment": "region_of_employment",
"prevailingwage": "prevailing_wage",
"unitofwage": "unit_of_wage",
"fulltimeposition": "full_time_position",
"casestatus": "case_status",
}
def canon_case_status(s: pd.Series) -> pd.Series:
s = s.astype("string").str.strip().str.lower()
m = {
"certified": "Certified",
"certified-expired": "Certified",
"denied": "Denied",
"rejected": "Denied",
"withdrawn": "Denied",
}
return s.map(m)
def canon_yn(s: pd.Series) -> pd.Series:
s = s.astype("string").str.strip().str.upper().replace({"YES":"Y","NO":"N"})
return s.where(s.isin(["Y","N"]), np.nan)
def canon_title(s: pd.Series) -> pd.Series:
return s.astype("string").str.strip().str.title()
def canon_education(s: pd.Series) -> pd.Series:
s = s.astype("string").str.strip().str.lower()
s = s.str.replace("’","'", regex=False).str.replace(r"\s+"," ", regex=True)
def map_one(x):
if x is None or x == "nan": return np.nan
if "high" in x: return "High School"
if "bachelor" in x: return "Bachelor's"
if "master" in x: return "Master's"
if "doctor" in x or "phd" in x: return "Doctorate"
return "Other"
return s.map(map_one)
# Internal columns ignore schema warnings
INTERNAL_COLS = {
"_case_label","_edu","_continent","_region","_uow",
"_job_exp","_job_train","_full_time"
}
# Column-level canonicalization
def canonicalize_columns(df: pd.DataFrame, name: str = "frame") -> Tuple[pd.DataFrame, Dict[str, Any]]:
"""Rename raw columns using _CANON_MAP, report unknowns/missing/collisions."""
raw = list(df.columns)
keys = [_norm(c) for c in raw]
# rename with collision protection
ren, used = {}, set()
for raw_col, k in zip(raw, keys):
new = _CANON_MAP.get(k, raw_col)
if new in used and new != raw_col:
new = raw_col # avoid overwrite
ren[raw_col] = new
used.add(new)
out = df.rename(columns=ren).copy()
# unknown headers skip internals expected names are the canonical keys’ values
expected = set(_CANON_MAP.values())
unknown = [(r, k) for r, k in zip(raw, keys)
if r not in INTERNAL_COLS and not str(r).startswith("_")
and (k not in _CANON_MAP) and (r not in expected)]
# missing expected canonical columns
missing = sorted([c for c in expected if c not in out.columns])
# collisions two raw normalize to same key
seen, collisions = {}, {}
for r, k in zip(raw, keys):
if k in seen and seen[k] != r:
collisions.setdefault(k, set()).update([seen[k], r])
else:
seen[k] = r
# print audit
if unknown:
print(f"[{name}] UNKNOWN headers:", [r for r,_ in unknown])
print(f"[{name}] Suggested _CANON_MAP patches:")
for r,k in unknown:
guess = difflib.get_close_matches(k, list(_CANON_MAP.keys()), n=1, cutoff=0.6)
guess_str = guess[0] if guess else "<add-key>"
print(f' _CANON_MAP["{k}"] = "{r}" # raw="{r}"')
else:
print(f"[{name}] No unknown headers. ✓")
print(f"[{name}] MISSING expected canonical columns:", missing if missing else "None ✓")
if collisions:
print(f"[{name}] COLLISIONS:")
for k, raws in collisions.items():
print(f" norm='{k}' ← raw {sorted(list(raws))}")
else:
print(f"[{name}] No collisions. ✓")
report = {
"dataset": name,
"unknown": [r for r,_ in unknown],
"unknown_count": len(unknown),
"missing": missing,
"missing_count": len(missing),
"collisions": {k: sorted(list(v)) for k,v in collisions.items()},
"collision_count": len(collisions),
"passed": (len(unknown) == 0 and len(missing) == 0 and len(collisions) == 0)
}
status = "PASS ✓" if report["passed"] else "WARN ✗"
print(f"[{name}] COLUMN SUMMARY → unknown={report['unknown_count']} | "
f"missing={report['missing_count']} | collisions={report['collision_count']} ⇒ {status}")
return out, report
# Value-level canonicalization
def apply_canonical_values(df: pd.DataFrame, name: str = "frame") -> Tuple[pd.DataFrame, Dict[str, Any]]:
"""Create canonical helper columns, coerce numerics, and print a validation summary."""
out = df.copy()
# target labels and numeric target
out["_case_label"] = canon_case_status(out["case_status"])
out["y"] = out["_case_label"].map({"Certified":1, "Denied":0}).astype("Int64")
# Y/N flags
out["_job_exp"] = canon_yn(out["has_job_experience"])
out["_job_train"] = canon_yn(out["requires_job_training"])
out["_full_time"] = canon_yn(out["full_time_position"])
# categoricals
out["_continent"] = canon_title(out["continent"])
out["_region"] = canon_title(out["region_of_employment"])
out["_uow"] = canon_title(out["unit_of_wage"])
out["_edu"] = canon_education(out["education_of_employee"])
# numerics keep raw wage normalization
out["no_of_employees"] = pd.to_numeric(out["no_of_employees"], errors="coerce")
out["yr_of_estab"] = pd.to_numeric(out["yr_of_estab"], errors="coerce")
out["prevailing_wage"] = pd.to_numeric(out["prevailing_wage"], errors="coerce")
# Validation report
# Validation report
rep: Dict[str, Any] = {"dataset": name}
# target check
lbl_counts = out["_case_label"].value_counts(dropna=False).to_dict()
y_counts = out["y"].value_counts(dropna=False).to_dict()
rep["target_label_counts"] = lbl_counts
rep["target_numeric_counts"] = y_counts
rep["target_na"] = int(out["_case_label"].isna().sum())
# Y/N validity
def yn_invalid(col: str) -> int:
# True where value is NOT in {Y,N} or is NA
mask_valid = out[col].isin(["Y","N"])
invalid = (~mask_valid) | mask_valid.isna()
return int(invalid.sum())
rep["job_exp_invalid"] = yn_invalid("_job_exp")
rep["job_train_invalid"] = yn_invalid("_job_train")
rep["full_time_invalid"] = yn_invalid("_full_time")
# education distribution flag 'Other' share if present
edu_counts = out["_edu"].value_counts(dropna=False)
rep["edu_counts"] = edu_counts.to_dict()
rep["edu_other_pct"] = float((edu_counts.get("Other", 0) / max(1, edu_counts.sum())) * 100)
# Print concise audit
print(f"[{name}] TARGET label counts:", lbl_counts)
print(f"[{name}] TARGET numeric counts:", y_counts)
print(f"[{name}] TARGET NaNs: {rep['target_na']}")
print(f"[{name}] YN invalid → job_exp={rep['job_exp_invalid']} | job_train={rep['job_train_invalid']} | full_time={rep['full_time_invalid']}")
print(f"[{name}] Education 'Other' share: {rep['edu_other_pct']:.1f}%")
# pass criteria no target NaNs & YN all valid
rep["passed"] = (rep["target_na"] == 0 and
rep["job_exp_invalid"] == 0 and
rep["job_train_invalid"] == 0 and
rep["full_time_invalid"] == 0)
status = "PASS ✓" if rep["passed"] else "WARN ✗"
print(f"[{name}] VALUE SUMMARY ⇒ {status}")
return out, rep
# One call convenience
def canonicalize_and_validate(df: pd.DataFrame, name: str = "EasyVisa", strict: bool = False) -> Tuple[pd.DataFrame, Dict[str, Any], Dict[str, Any]]:
"""
Column rename + value canonicalization with full validation.
Returns (DF, column_report, value_report).
"""
df1, col_rep = canonicalize_columns(df, name=f"{name}-cols")
DF, val_rep = apply_canonical_values(df1, name=f"{name}-vals")
all_pass = col_rep["passed"] and val_rep["passed"]
final_status = "PASS ✓" if all_pass else "WARN ✗"
print(f"[{name}] CANONICALIZATION SUMMARY → columns={col_rep['passed']} & values={val_rep['passed']} ⇒ {final_status}\n")
if strict and not all_pass:
raise AssertionError(f"[{name}] Canonicalization failed: col={col_rep}, val={val_rep}")
return DF, col_rep, val_rep
# Build canonical frame with verification
DF, col_report, val_report = canonicalize_and_validate(df, name="EasyVisa", strict=True)
[EasyVisa-cols] No unknown headers. ✓
[EasyVisa-cols] MISSING expected canonical columns: None ✓
[EasyVisa-cols] No collisions. ✓
[EasyVisa-cols] COLUMN SUMMARY → unknown=0 | missing=0 | collisions=0 ⇒ PASS ✓
[EasyVisa-vals] TARGET label counts: {'Certified': 17018, 'Denied': 8462}
[EasyVisa-vals] TARGET numeric counts: {np.int64(1): 17018, np.int64(0): 8462}
[EasyVisa-vals] TARGET NaNs: 0
[EasyVisa-vals] YN invalid → job_exp=0 | job_train=0 | full_time=0
[EasyVisa-vals] Education 'Other' share: 0.0%
[EasyVisa-vals] VALUE SUMMARY ⇒ PASS ✓
[EasyVisa] CANONICALIZATION SUMMARY → columns=True & values=True ⇒ PASS ✓
# C) RENAME AND SCHEMA GUARD
from typing import Dict, Any
import difflib
# helper/internal columns create during canonicalization
INTERNAL_COLS = {
"_case_label","_edu","_continent","_region","_uow",
"_job_exp","_job_train","_full_time"
}
def rename_canonical(df: pd.DataFrame) -> pd.DataFrame:
"""Rename columns using _CANON_MAP while avoiding alias collisions."""
ren, used = {}, set()
for c in df.columns:
k = _norm(c)
new = _CANON_MAP.get(k, c)
# avoid overwriting when alias and real both exist
if new in used and new != c:
new = c
ren[c] = new
used.add(new)
return df.rename(columns=ren).copy()
def schema_guard(df: pd.DataFrame, name: str = "frame", strict: bool = False) -> Dict[str, Any]:
"""
Validate schema quality:
- unknown headers (not in alias map or expected), ignoring internal helper cols
- missing expected columns (after rename)
- collisions (two raw headers normalize to same key)
Returns a dict report; raises AssertionError if strict=True and validation fails.
"""
raw = list(df.columns)
keys = [_norm(c) for c in raw]
# Unknown headers skip internals and underscore-prefixed helpers
unknown = [
(r, k) for r, k in zip(raw, keys)
if (r not in INTERNAL_COLS)
and (not str(r).startswith("_"))
and (k not in _CANON_MAP)
and (r not in EXPECTED_COLS)
]
if unknown:
print(f"[{name}] UNKNOWN headers:", [r for r,_ in unknown])
print(f"[{name}] Suggested _CANON_MAP patches:")
for r, k in unknown:
guess = difflib.get_close_matches(k, list(_CANON_MAP.keys()), n=1, cutoff=0.6)
guess_str = guess[0] if guess else "<add-key>"
print(f' _CANON_MAP["{k}"] = "{r}" # raw="{r}"')
else:
print(f"[{name}] No unknown headers. ✓")
# Missing expected columns after rename
missing = sorted([c for c in EXPECTED_COLS if c not in df.columns])
print(f"[{name}] MISSING expected columns:", missing if missing else "None ✓")
# Collisions two raw → same normalized key
seen, collisions = {}, {}
for r, k in zip(raw, keys):
if k in seen and seen[k] != r:
collisions.setdefault(k, set()).update([seen[k], r])
else:
seen[k] = r
if collisions:
print(f"[{name}] COLLISIONS:")
for k, raws in collisions.items():
print(f" norm='{k}' ← raw {sorted(list(raws))}")
else:
print(f"[{name}] No collisions. ✓")
# Summary and machine readable report
report = {
"dataset": name,
"unknown": [r for r,_ in unknown],
"unknown_count": len(unknown),
"missing": missing,
"missing_count": len(missing),
"collisions": {k: sorted(list(v)) for k, v in collisions.items()},
"collision_count": len(collisions),
"passed": (len(unknown) == 0 and len(missing) == 0 and len(collisions) == 0),
}
status = "PASS ✓" if report["passed"] else "FAIL ✗"
print(f"[{name}] SUMMARY → unknown={report['unknown_count']} | "
f"missing={report['missing_count']} | collisions={report['collision_count']} ⇒ {status}")
if strict and not report["passed"]:
raise AssertionError(f"[{name}] Schema validation failed: {report}")
return report
# After loading raw df
df_renamed = rename_canonical(df)
# Print audit and get structured report
rep = schema_guard(df_renamed, name="EasyVisa-cols", strict=True)
[EasyVisa-cols] No unknown headers. ✓ [EasyVisa-cols] MISSING expected columns: None ✓ [EasyVisa-cols] No collisions. ✓ [EasyVisa-cols] SUMMARY → unknown=0 | missing=0 | collisions=0 ⇒ PASS ✓
# D) BUILD CANONICAL DF
from typing import Dict, Any, Tuple
def build_canonical_df(df: pd.DataFrame, name: str = "EasyVisa", strict: bool = False
) -> Tuple[pd.DataFrame, Dict[str, Any]]:
"""
Rename → canonicalize values → coerce numerics → validate.
Returns (DF, report). If strict=True, raises on failed validation.
"""
# 1) Column rename and schema audit
out = rename_canonical(df).copy()
col_rep = schema_guard(out, f"{name}-cols", strict=False)
# 2) Target string and numeric
out[TARGET_LBL] = canon_case_status(out[TARGET_RAW])
out[TARGET_BIN] = out[TARGET_LBL].map({"Certified": 1, "Denied": 0}).astype("Int64")
# 3) Binary flags
out["_job_exp"] = canon_yn(out["has_job_experience"])
out["_job_train"] = canon_yn(out["requires_job_training"])
out["_full_time"] = canon_yn(out["full_time_position"])
# 4) Core categoricals
out["_continent"] = canon_title(out["continent"])
out["_region"] = canon_title(out["region_of_employment"])
out["_uow"] = canon_title(out["unit_of_wage"])
out["_edu"] = canon_education(out["education_of_employee"])
# 5) Numerics coerce safely wage normalization happens later
out["no_of_employees"] = pd.to_numeric(out["no_of_employees"], errors="coerce")
out["yr_of_estab"] = pd.to_numeric(out["yr_of_estab"], errors="coerce")
out["prevailing_wage"] = pd.to_numeric(out["prevailing_wage"], errors="coerce")
# 6) Value level validation
rep: Dict[str, Any] = {"dataset": name}
# Target checks
rep["target_label_counts"] = out[TARGET_LBL].value_counts(dropna=False).to_dict()
rep["target_numeric_counts"] = out[TARGET_BIN].value_counts(dropna=False).to_dict()
rep["target_na"] = int(out[TARGET_LBL].isna().sum())
rep["y_na"] = int(out[TARGET_BIN].isna().sum())
# Y/N validity
def yn_invalid(col: str) -> int:
m = out[col].isin(["Y", "N"])
return int((~m | m.isna()).sum())
rep["job_exp_invalid"] = yn_invalid("_job_exp")
rep["job_train_invalid"] = yn_invalid("_job_train")
rep["full_time_invalid"] = yn_invalid("_full_time")
# Numeric NA counts sanity only
rep["no_of_employees_na"] = int(out["no_of_employees"].isna().sum())
rep["yr_of_estab_na"] = int(out["yr_of_estab"].isna().sum())
rep["prevailing_wage_na"] = int(out["prevailing_wage"].isna().sum())
# Pass criteria values — tune as needed
rep["values_passed"] = (
rep["target_na"] == 0 and rep["y_na"] == 0
and rep["job_exp_invalid"] == 0
and rep["job_train_invalid"] == 0
and rep["full_time_invalid"] == 0
)
# 7) Final summary
overall_pass = col_rep["passed"] and rep["values_passed"]
rep["columns_passed"] = col_rep["passed"]
rep["passed"] = overall_pass
# Print concise summary
print(f"[{name}-vals] TARGET NaNs: labels={rep['target_na']} | y={rep['y_na']}")
print(f"[{name}-vals] Y/N invalid → job_exp={rep['job_exp_invalid']} | "
f"job_train={rep['job_train_invalid']} | full_time={rep['full_time_invalid']}")
print(f"[{name}-vals] Numeric NaNs → employees={rep['no_of_employees_na']} | "
f"yr_of_estab={rep['yr_of_estab_na']} | wage={rep['prevailing_wage_na']}")
status_cols = "PASS ✓" if col_rep["passed"] else "FAIL ✗"
status_vals = "PASS ✓" if rep["values_passed"] else "FAIL ✗"
status_all = "PASS ✓" if overall_pass else "FAIL ✗"
print(f"[{name}] CANON SUMMARY → columns={status_cols} | values={status_vals} ⇒ {status_all}\n")
if strict and not overall_pass:
raise AssertionError(f"[{name}] Canonicalization failed: columns={col_rep}, values={rep}")
return out, rep
# ID Freeze Guard
ID_COL = "case_id"
# 1) Normalize & lock dtype
DF[ID_COL] = DF[ID_COL].astype("string").str.strip()
# 2) Basic integrity
assert DF[ID_COL].notna().all(), "Null IDs detected."
dups = DF[ID_COL].value_counts()
assert (dups > 1).sum() == 0, f"Duplicate IDs found: {dups[dups>1].head(10).to_dict()}"
# 3) Deterministic ordering (optional but helpful)
DF = DF.sort_values(ID_COL).reset_index(drop=True)
# 4) Freeze a copy never use as a feature
ID_FREEZE = DF[ID_COL].copy() # keep alongside DF for merges / OOF / submissions
# Never include ID
DROP = {"case_id", "case_status", "_case_label", "y"}
X = DF.drop(columns=[c for c in DROP if c in DF.columns])
y = DF["y"].to_numpy()
print(f"[OK] Built X (features) and y (target) with shapes {X.shape}, {y.shape}")
[OK] Built X (features) and y (target) with shapes (25480, 17), (25480,)
# E) RUN & REPORT
# 1) Core verifications on the raw frame
verify_rng()
verify_target_exists(df)
raw_schema = verify_expected_columns(df)
# 2) Build canonical dataframe and value audit
DF, canon_report = build_canonical_df(df, name="EasyVisa", strict=True)
# 3) Target class balance labels and numeric
lbl_bal = report_class_balance(DF[TARGET_LBL], "[INFO] target label balance")
num_bal = report_class_balance(DF[TARGET_BIN], "[INFO] target numeric balance")
# 4) Quick canonical feature sanity uniques
key_cats = ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"]
uniques = {c: int(DF[c].dropna().nunique()) for c in key_cats}
print("\n[INFO] canonical categorical uniques:", uniques)
# 5) Compact PASS / FAIL footer
overall_pass = bool(raw_schema["passed"] and canon_report["passed"])
status = "PASS ✓" if overall_pass else "WARN ✗"
print("\n================ SUMMARY ================")
print(f"rows={len(DF):,} cols={DF.shape[1]} | "
f"raw_expected_ok={raw_schema['passed']} "
f"columns_passed={canon_report['columns_passed']} "
f"values_passed={canon_report['values_passed']} ⇒ {status}")
print("========================================\n")
# 6) Machine readable bundle
verification_bundle = {
"raw_schema": raw_schema,
"canonicalization": canon_report,
"label_balance": lbl_bal,
"numeric_balance": num_bal,
"cat_uniques": uniques,
"passed": overall_pass,
}
assert verification_bundle["passed"], f"Verification failed: {verification_bundle}"
[OK] RNG reproducibility
[OK] Target column present
[OK] All expected columns present
[EasyVisa-cols] No unknown headers. ✓
[EasyVisa-cols] MISSING expected columns: None ✓
[EasyVisa-cols] No collisions. ✓
[EasyVisa-cols] SUMMARY → unknown=0 | missing=0 | collisions=0 ⇒ PASS ✓
[EasyVisa-vals] TARGET NaNs: labels=0 | y=0
[EasyVisa-vals] Y/N invalid → job_exp=0 | job_train=0 | full_time=0
[EasyVisa-vals] Numeric NaNs → employees=0 | yr_of_estab=0 | wage=0
[EasyVisa] CANON SUMMARY → columns=PASS ✓ | values=PASS ✓ ⇒ PASS ✓
[INFO] target label balance →
_case_label
Certified 17018
Denied 8462
Imbalance ratio ≈ 2.01:1
[INFO] target numeric balance →
y
1 17018
0 8462
Imbalance ratio ≈ 2.01:1
[INFO] canonical categorical uniques: {'_edu': 4, '_continent': 6, '_region': 5, '_uow': 4, '_job_exp': 2, '_job_train': 2, '_full_time': 2}
================ SUMMARY ================
rows=25,480 cols=21 | raw_expected_ok=True columns_passed=True values_passed=True ⇒ PASS ✓
========================================
Let's check the statistical summary of the data¶
#Print the statistics
DF.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| case_id | 25480 | 25480 | EZYV8640 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| continent | 25480 | 6 | Asia | 16861 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| education_of_employee | 25480 | 4 | Bachelor's | 10234 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| has_job_experience | 25480 | 2 | Y | 14802 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| requires_job_training | 25480 | 2 | N | 22525 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| no_of_employees | 25480.0 | NaN | NaN | NaN | 5667.04321 | 22877.928848 | -26.0 | 1022.0 | 2109.0 | 3504.0 | 602069.0 |
| yr_of_estab | 25480.0 | NaN | NaN | NaN | 1979.409929 | 42.366929 | 1800.0 | 1976.0 | 1997.0 | 2005.0 | 2016.0 |
| region_of_employment | 25480 | 5 | Northeast | 7195 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| prevailing_wage | 25480.0 | NaN | NaN | NaN | 74455.814592 | 52815.942327 | 2.1367 | 34015.48 | 70308.21 | 107735.5125 | 319210.27 |
| unit_of_wage | 25480 | 4 | Year | 22962 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| full_time_position | 25480 | 2 | Y | 22773 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| case_status | 25480 | 2 | Certified | 17018 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| _case_label | 25480 | 2 | Certified | 17018 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| y | 25480.0 | <NA> | <NA> | <NA> | 0.667896 | 0.470977 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| _job_exp | 25480 | 2 | Y | 14802 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| _job_train | 25480 | 2 | N | 22525 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| _full_time | 25480 | 2 | Y | 22773 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| _continent | 25480 | 6 | Asia | 16861 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| _region | 25480 | 5 | Northeast | 7195 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| _uow | 25480 | 4 | Year | 22962 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| _edu | 25480 | 4 | Bachelor's | 10234 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# 1) Total missing values
DF.isna().sum().sort_values(ascending=False)
# 2) Split by type
DF.select_dtypes('number').isna().sum().sort_values(ascending=False)
DF.select_dtypes(exclude='number').isna().sum().sort_values(ascending=False)
# Hard guardrail
assert DF.isna().sum().sum() == 0, "Unexpected NaNs remain after cleaning."
# Numeric summary
num_cols = DF.select_dtypes('number').columns
num_desc = DF[num_cols].describe().T
num_desc['nulls'] = DF[num_cols].isna().sum().values
num_desc['null_pct'] = (DF[num_cols].isna().mean().values * 100).round(2)
# Categorical summary
cat_cols = DF.select_dtypes(exclude='number').columns
cat_desc = DF[cat_cols].describe().T
cat_desc['nulls'] = DF[cat_cols].isna().sum().values
cat_desc['null_pct'] = (DF[cat_cols].isna().mean().values * 100).round(2)
display(num_desc, cat_desc)
| count | mean | std | min | 25% | 50% | 75% | max | nulls | null_pct | |
|---|---|---|---|---|---|---|---|---|---|---|
| no_of_employees | 25480.0 | 5667.04321 | 22877.928848 | -26.0 | 1022.0 | 2109.0 | 3504.0 | 602069.0 | 0 | 0.0 |
| yr_of_estab | 25480.0 | 1979.409929 | 42.366929 | 1800.0 | 1976.0 | 1997.0 | 2005.0 | 2016.0 | 0 | 0.0 |
| prevailing_wage | 25480.0 | 74455.814592 | 52815.942327 | 2.1367 | 34015.48 | 70308.21 | 107735.5125 | 319210.27 | 0 | 0.0 |
| y | 25480.0 | 0.667896 | 0.470977 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0 | 0.0 |
| count | unique | top | freq | nulls | null_pct | |
|---|---|---|---|---|---|---|
| case_id | 25480 | 25480 | EZYV8640 | 1 | 0 | 0.0 |
| continent | 25480 | 6 | Asia | 16861 | 0 | 0.0 |
| education_of_employee | 25480 | 4 | Bachelor's | 10234 | 0 | 0.0 |
| has_job_experience | 25480 | 2 | Y | 14802 | 0 | 0.0 |
| requires_job_training | 25480 | 2 | N | 22525 | 0 | 0.0 |
| region_of_employment | 25480 | 5 | Northeast | 7195 | 0 | 0.0 |
| unit_of_wage | 25480 | 4 | Year | 22962 | 0 | 0.0 |
| full_time_position | 25480 | 2 | Y | 22773 | 0 | 0.0 |
| case_status | 25480 | 2 | Certified | 17018 | 0 | 0.0 |
| _case_label | 25480 | 2 | Certified | 17018 | 0 | 0.0 |
| _job_exp | 25480 | 2 | Y | 14802 | 0 | 0.0 |
| _job_train | 25480 | 2 | N | 22525 | 0 | 0.0 |
| _full_time | 25480 | 2 | Y | 22773 | 0 | 0.0 |
| _continent | 25480 | 6 | Asia | 16861 | 0 | 0.0 |
| _region | 25480 | 5 | Northeast | 7195 | 0 | 0.0 |
| _uow | 25480 | 4 | Year | 22962 | 0 | 0.0 |
| _edu | 25480 | 4 | Bachelor's | 10234 | 0 | 0.0 |
# Overall missingness
miss = DF.isna().sum().sort_values(ascending=False)
print(miss[miss > 0].to_string())
# % missing
pct = (DF.isna().mean()*100).sort_values(ascending=False)
print("\n% missing (top 21):\n", pct.head(25).round(2))
Series([], ) % missing (top 21): case_id 0.0 continent 0.0 education_of_employee 0.0 has_job_experience 0.0 requires_job_training 0.0 no_of_employees 0.0 yr_of_estab 0.0 region_of_employment 0.0 prevailing_wage 0.0 unit_of_wage 0.0 full_time_position 0.0 case_status 0.0 _case_label 0.0 y 0.0 _job_exp 0.0 _job_train 0.0 _full_time 0.0 _continent 0.0 _region 0.0 _uow 0.0 _edu 0.0 dtype: float64
def norm_str(x):
return (str(x)
.replace("\u00a0"," ")
.strip()
.lower())
col_raw, col_mapped = "continent", "_continent" # adjust if different
# 1) What failed to map? (only meaningful if col_mapped exists)
if col_mapped in DF.columns and col_raw in DF.columns:
failed = (DF.loc[DF[col_mapped].isna(), col_raw]
.dropna().astype(str).map(norm_str).value_counts())
print(f"\nUnmapped raw values for {col_raw} -> {col_mapped}:\n", failed.head(50))
else:
print(f"\n[WARN] Columns not found for diagnostic: {col_raw} / {col_mapped}")
# 2) Compare raw uniques vs *value-map* keys (if you have one)
raw_uniques = set(DF[col_raw].dropna().astype(str).map(norm_str).unique())
# Try to locate a VALUE mapping dict. Update these names to your actual objects:
# - COL_RENAME: raw->canonical column rename dict
# - VAL_MAPS: per-column value maps, e.g., {"continent": {"na": "North America", ...}, ...}
val_map = None
try:
# preferred: a dedicated value-map container
val_map = VAL_MAPS.get(col_raw) # <-- change to your value-map structure name
except NameError:
pass
if val_map is None:
# sometimes people store all maps in one big CANON_MAP; try to pull dicts out safely
try:
maybe = CANON_MAP.get(col_raw)
val_map = maybe if isinstance(maybe, dict) else None
except NameError:
val_map = None
if isinstance(val_map, dict):
map_keys = set(norm_str(k) for k in val_map.keys())
only_in_raw = sorted(raw_uniques - map_keys)
print(f"\nRaw-only values (not in value map for {col_raw}):", only_in_raw[:50])
else:
print(f"\n[INFO] No value-mapping dict registered for {col_raw} — "
f"looks like you only have a column rename. If you expect value normalization, "
f"define VAL_MAPS['{col_raw}'] = {{...}} and re-run.")
Unmapped raw values for continent -> _continent:
Series([], Name: count, dtype: int64)
[INFO] No value-mapping dict registered for continent — looks like you only have a column rename. If you expect value normalization, define VAL_MAPS['continent'] = {...} and re-run.
def norm_str(x):
return (str(x).replace("\u00a0"," ").strip().lower())
def cat_uniques_report(df, cols):
for c in cols:
if c in df.columns:
raw = df[c].dropna().astype(str)
print(f"\n[{c}] uniques: {raw.nunique()} (raw)")
print("Top (raw):", raw.value_counts().head(10).to_dict())
norm = raw.map(norm_str)
print(f"[{c}] uniques after normalization: {norm.nunique()}")
print("Top (norm):", norm.value_counts().head(10).to_dict())
cat_cols = [
"_continent","_region","_uow","_full_time","_job_exp","_job_train","_edu"
]
cat_uniques_report(DF, cat_cols)
[_continent] uniques: 6 (raw)
Top (raw): {'Asia': 16861, 'Europe': 3732, 'North America': 3292, 'South America': 852, 'Africa': 551, 'Oceania': 192}
[_continent] uniques after normalization: 6
Top (norm): {'asia': 16861, 'europe': 3732, 'north america': 3292, 'south america': 852, 'africa': 551, 'oceania': 192}
[_region] uniques: 5 (raw)
Top (raw): {'Northeast': 7195, 'South': 7017, 'West': 6586, 'Midwest': 4307, 'Island': 375}
[_region] uniques after normalization: 5
Top (norm): {'northeast': 7195, 'south': 7017, 'west': 6586, 'midwest': 4307, 'island': 375}
[_uow] uniques: 4 (raw)
Top (raw): {'Year': 22962, 'Hour': 2157, 'Week': 272, 'Month': 89}
[_uow] uniques after normalization: 4
Top (norm): {'year': 22962, 'hour': 2157, 'week': 272, 'month': 89}
[_full_time] uniques: 2 (raw)
Top (raw): {'Y': 22773, 'N': 2707}
[_full_time] uniques after normalization: 2
Top (norm): {'y': 22773, 'n': 2707}
[_job_exp] uniques: 2 (raw)
Top (raw): {'Y': 14802, 'N': 10678}
[_job_exp] uniques after normalization: 2
Top (norm): {'y': 14802, 'n': 10678}
[_job_train] uniques: 2 (raw)
Top (raw): {'N': 22525, 'Y': 2955}
[_job_train] uniques after normalization: 2
Top (norm): {'n': 22525, 'y': 2955}
[_edu] uniques: 4 (raw)
Top (raw): {"Bachelor's": 10234, "Master's": 9634, 'High School': 3420, 'Doctorate': 2192}
[_edu] uniques after normalization: 4
Top (norm): {"bachelor's": 10234, "master's": 9634, 'high school': 3420, 'doctorate': 2192}
na_counts = DF.isna().sum().sort_values(ascending=False)
print(na_counts[na_counts > 0].to_string())
Series([], )
# Check unexpected units
uow_norm = DF["_uow"].astype(str).str.strip().str.lower()
bad_uow = uow_norm[~uow_norm.isin(["year","month","week","biweek","day","hour"])].value_counts()
print("\nUnexpected _uow values:\n", bad_uow.head(30))
Unexpected _uow values: Series([], Name: count, dtype: int64)
for c in ["case_status", "_case_label", "y"]:
assert c in DF.columns, f"Missing: {c}"
print(DF[["_case_label","y"]].isna().sum())
print(DF["_case_label"].value_counts(dropna=False))
print(DF["y"].value_counts(dropna=False))
_case_label 0 y 0 dtype: int64 _case_label Certified 17018 Denied 8462 Name: count, dtype: int64 y 1 17018 0 8462 Name: count, dtype: Int64
#
display(DF.select_dtypes(include="number").describe().T)
display(DF.select_dtypes(include=["object","category","bool"]).describe().T)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_employees | 25480.0 | 5667.04321 | 22877.928848 | -26.0 | 1022.0 | 2109.0 | 3504.0 | 602069.0 |
| yr_of_estab | 25480.0 | 1979.409929 | 42.366929 | 1800.0 | 1976.0 | 1997.0 | 2005.0 | 2016.0 |
| prevailing_wage | 25480.0 | 74455.814592 | 52815.942327 | 2.1367 | 34015.48 | 70308.21 | 107735.5125 | 319210.27 |
| y | 25480.0 | 0.667896 | 0.470977 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| count | unique | top | freq | |
|---|---|---|---|---|
| case_id | 25480 | 25480 | EZYV8640 | 1 |
| continent | 25480 | 6 | Asia | 16861 |
| education_of_employee | 25480 | 4 | Bachelor's | 10234 |
| has_job_experience | 25480 | 2 | Y | 14802 |
| requires_job_training | 25480 | 2 | N | 22525 |
| region_of_employment | 25480 | 5 | Northeast | 7195 |
| unit_of_wage | 25480 | 4 | Year | 22962 |
| full_time_position | 25480 | 2 | Y | 22773 |
| case_status | 25480 | 2 | Certified | 17018 |
| _case_label | 25480 | 2 | Certified | 17018 |
| _edu | 25480 | 4 | Bachelor's | 10234 |
# E) Build canonical frame (idempotent)
try:
_ = df # noqa: F401
except NameError:
# Fallback: load if user re-runs from here
df = read_csv_safely(data_path)
df_can, audit = build_canonical_df(df, name="EasyVisa", strict=False)
print("[OK] Canonical DF built:", df_can.shape)
print("[OK] Columns:", sorted(df_can.columns)[:12], "…")
verify_target_exists(df_can)
report_class_balance(df_can[TARGET_BIN], title="[INFO] Class balance (y)")
[EasyVisa-cols] No unknown headers. ✓ [EasyVisa-cols] MISSING expected columns: None ✓ [EasyVisa-cols] No collisions. ✓ [EasyVisa-cols] SUMMARY → unknown=0 | missing=0 | collisions=0 ⇒ PASS ✓ [EasyVisa-vals] TARGET NaNs: labels=0 | y=0 [EasyVisa-vals] Y/N invalid → job_exp=0 | job_train=0 | full_time=0 [EasyVisa-vals] Numeric NaNs → employees=0 | yr_of_estab=0 | wage=0 [EasyVisa] CANON SUMMARY → columns=PASS ✓ | values=PASS ✓ ⇒ PASS ✓ [OK] Canonical DF built: (25480, 21) [OK] Columns: ['_case_label', '_continent', '_edu', '_full_time', '_job_exp', '_job_train', '_region', '_uow', 'case_id', 'case_status', 'continent', 'education_of_employee'] … [OK] Target column present [INFO] Class balance (y) → y 1 17018 0 8462 Imbalance ratio ≈ 2.01:1
{'counts': {np.int64(1): 17018, np.int64(0): 8462}, 'ratio': 2.011108484991728}
Fixing the negative values in number of employees columns¶
# Clean negatives in no_of_employees on the canonical dataframe (df_can)
col = "no_of_employees"
# Work on a copy to avoid chained assignment surprises; reassign to df_can at the end
_df = df_can.copy()
# Coerce to numeric (in case of stray strings); invalid -> NaN
_df[col] = pd.to_numeric(_df[col], errors="coerce")
# Count negatives (ignore NaN)
neg_mask = _df[col] < 0
neg_count = int(neg_mask.sum())
print(f"[INFO] Found {neg_count} negative values in {col}")
# Compute mean of valid (non-negative, non-null) values
valid_mask = _df[col].ge(0) & _df[col].notna()
valid_mean = _df.loc[valid_mask, col].mean()
if np.isfinite(valid_mean):
fill_value = int(round(valid_mean))
print(f"[OK] Mean of valid {col} = {valid_mean:.2f} -> using {fill_value} for replacement")
_df.loc[neg_mask, col] = fill_value
print("[DONE] Negative values replaced with mean-rounded")
else:
print(f"[WARN] No valid non-negative {col} values to compute mean. No replacements made.")
# Reassign back
df_can = _df
print(df_can[col].describe())
[INFO] Found 33 negative values in no_of_employees [OK] Mean of valid no_of_employees = 5674.42 -> using 5674 for replacement [DONE] Negative values replaced with mean-rounded count 25480.000000 mean 5674.414796 std 22877.012865 min 12.000000 25% 1028.000000 50% 2114.000000 75% 3515.000000 max 602069.000000 Name: no_of_employees, dtype: float64
Let's check the count of each unique category in each of the categorical variables¶
# Unique values per column
df.nunique()
case_id 25480 continent 6 education_of_employee 4 has_job_experience 2 requires_job_training 2 no_of_employees 7105 yr_of_estab 199 region_of_employment 5 prevailing_wage 25454 unit_of_wage 4 full_time_position 2 case_status 2 dtype: int64
Univariate Analysis¶
# Univariate Analysis Function (updated for canonical df_can)
def univariate_box_hist(
data: pd.DataFrame,
feature: str,
kde: bool = False,
bins: int | None = None,
figsize: tuple = (12, 6),
color: str = "#4C72B0",
log_transform: bool = False
):
# Prepare data
x = pd.to_numeric(data[feature], errors="coerce").dropna()
if log_transform:
# Ensure non-negative before log1p (robust to any residual negatives)
x = np.log1p(np.clip(x, a_min=0, a_max=None))
if x.empty:
print(f"[WARN] No numeric data available for '{feature}' after coercion/dropna.")
return None, (None, None)
# Create figure with two stacked subplots
fig, (ax_box, ax_hist) = plt.subplots(
nrows=2,
sharex=True,
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
)
# ---- Boxplot ----
sns.boxplot(
x=x,
ax=ax_box,
color=color,
showmeans=True,
meanprops={"marker": "D", "markerfacecolor": "white", "markeredgecolor": "black"},
flierprops={"marker": "o", "markersize": 4, "alpha": 0.5},
)
ax_box.set_title(f"{feature} — Boxplot & Histogram", fontsize=13, weight="bold", pad=10)
ax_box.set_xlabel("")
ax_box.grid(True, axis="x", linestyle="--", alpha=0.4)
ax_box.set_yticks([])
# ---- Histogram ----
sns.histplot(
x=x,
kde=kde,
bins=bins,
ax=ax_hist,
color=color,
edgecolor="white",
alpha=0.8,
)
# Add mean & median lines
mean, median = float(np.mean(x)), float(np.median(x))
ax_hist.axvline(mean, color="red", linestyle="--", linewidth=1.2, label=f"Mean = {mean:,.1f}")
ax_hist.axvline(median, color="black", linestyle="-", linewidth=1.2, label=f"Median = {median:,.1f}")
# Beautify
ax_hist.legend()
ax_hist.grid(True, axis="y", linestyle="--", alpha=0.4)
ax_hist.set_xlabel(feature + (" (log1p)" if log_transform else ""))
ax_hist.set_ylabel("Count")
plt.tight_layout()
plt.show()
return fig, (ax_box, ax_hist)
# === Usage with canonical dataframe ===
univariate_box_hist(df_can, "no_of_employees", kde=True, bins=40)
print("\nlog1p transformed")
univariate_box_hist(df_can, "no_of_employees", kde=True, bins=40, log_transform=True)
log1p transformed
(<Figure size 1200x600 with 2 Axes>,
(<Axes: title={'center': 'no_of_employees — Boxplot & Histogram'}>,
<Axes: xlabel='no_of_employees (log1p)', ylabel='Count'>))
# Bar graphs with labels (robust percent formatting; no FuncFormatter dependency)
import matplotlib.ticker as mticker
def labeled_barplot(data: pd.DataFrame, feature: str, perc: bool = False, n: int | None = None):
# 1) Prepare counts (robust to NaNs and mixed dtypes)
s = data[feature].astype("string").fillna("NaN")
vc = s.value_counts(dropna=False)
if n is not None and n > 0 and len(vc) > n:
head = vc.iloc[:n]
tail_sum = int(vc.iloc[n:].sum())
vc = pd.concat([head, pd.Series({"Other": tail_sum})])
total = int(vc.sum())
yvals = (vc / total * 100.0) if perc else vc
# 2) Figure size scales with number of bars
k = len(vc)
fig_w = max(6.5, 0.6 * k)
fig, ax = plt.subplots(figsize=(fig_w, 5))
# 3) Plot
bars = ax.bar(vc.index.astype(str), yvals.values, edgecolor="white")
# 4) Labels on bars
for bar, raw in zip(bars, vc.values):
h = bar.get_height()
label = f"{(raw/total*100):.1f}%" if perc else f"{int(raw):,}"
ax.annotate(
label,
(bar.get_x() + bar.get_width()/2, h),
ha="center", va="bottom",
xytext=(0, 3), textcoords="offset points",
fontsize=10
)
# 5) Cosmetics
ax.set_ylabel("Percentage" if perc else "Count")
ax.set_title(f"{feature} — distribution", weight="bold")
ax.grid(True, axis="y", alpha=0.25)
plt.setp(ax.get_xticklabels(), rotation=45, ha="right")
if perc:
ax.yaxis.set_major_formatter(mticker.PercentFormatter(xmax=100)) # <-- fix
plt.tight_layout()
return fig, ax
# === Usage with canonical dataframe ===
labeled_barplot(df_can, "_region") # counts with labels
labeled_barplot(df_can, "_edu", perc=True, n=6) # percentages (top 6 + 'Other')
(<Figure size 650x500 with 1 Axes>,
<Axes: title={'center': '_edu — distribution'}, ylabel='Percentage'>)
Observations on region of employment¶
The dataset shows that most visa instances are concentrated in the Northeast and South regions that combined account for the largest share of all instances
The West and Midwest regions have moderate representation indicating a balanced distribution across central and western states
The Island region contributes a very small percentage of cases suggesting small labor demand or limited reporting
Observations on education of employee¶
- Most of the instances fall within the Bachelor’s and Master’s degree categories, with only a small percentage at the Doctorate level.
Observations on job experience¶
# Job experience distribution
labeled_barplot(df,
"has_job_experience",
perc=True
)
(<Figure size 650x500 with 1 Axes>,
<Axes: title={'center': 'has_job_experience — distribution'}, ylabel='Percentage'>)
- The distribution shows that approx. 58% of instances have had prior jobs while 42% have not this can be one-hot encoded to 1 and 0
Observations on case status¶
df["_case_label"] = df["case_status"].str.strip().str.title()
labeled_barplot(df,
"_case_label",
perc=True
)
(<Figure size 650x500 with 1 Axes>,
<Axes: title={'center': '_case_label — distribution'}, ylabel='Percentage'>)
- There is an imbalance of two to one with approx. 33% of the instances in the Denied category and approx. 66% in the Certified category.
Bivariate Analysis¶
Creating functions that will help us with further analysis.
# Bivariate Analysis
def _fd_bins(x: np.ndarray, min_bins: int = 20, max_bins: int = 120) -> int:
"""Freedman–Diaconis rule with sensible caps."""
x = x[~np.isnan(x)]
if x.size < 2:
return min_bins
q25, q75 = np.percentile(x, [25, 75])
iqr = q75 - q25
if iqr <= 0:
return min_bins
h = 2 * iqr * (x.size ** (-1/3))
if h <= 0:
return min_bins
k = int(np.ceil((x.max() - x.min()) / h))
return max(min_bins, min(max_bins, k))
def bivariate_with_target(
data: pd.DataFrame,
predictor: str,
target: str,
*,
bins: int | str | np.ndarray | None = None,
log_x: bool = False,
colors: tuple[str, str] = ("#2E8B57", "#B22222")
):
# Prepare data
df = data[[predictor, target]].copy().dropna(subset=[predictor, target])
df[target] = df[target].astype(str).str.strip().str.title()
classes = df[target].unique()
if len(classes) != 2:
raise ValueError(f"{target!r} must be binary; found {len(classes)} classes: {sorted(classes)}")
# Prefer deterministic order Certified/Denied when present; else alphabetical
cls_set = set(classes.tolist())
if {"Certified", "Denied"}.issubset(cls_set):
pos, neg = "Certified", "Denied"
else:
pos, neg = sorted(classes)
pal = colors if len(colors) == 2 else ("#4C72B0", "#DD8452")
# Numeric vector, optional log
x = pd.to_numeric(df[predictor], errors="coerce")
df = df.loc[x.notna()].copy()
x = x.loc[df.index]
if log_x:
x = np.log1p(np.clip(x, a_min=0, a_max=None))
df[predictor] = x
xlabel = f"log1p({predictor})"
else:
df[predictor] = x
xlabel = predictor
# Decide bins if None
if bins is None:
bins = _fd_bins(df[predictor].to_numpy())
fig, axs = plt.subplots(2, 2, figsize=(12, 8), constrained_layout=True)
# KDE hist – positive class
sns.histplot(
data=df[df[target] == pos],
x=predictor,
bins=bins,
kde=True,
stat="density",
ax=axs[0, 0],
color=pal[0],
edgecolor="white",
alpha=0.85
)
axs[0, 0].set_title(f"{predictor} | {target} = {pos}", weight="bold")
axs[0, 0].set_xlabel(xlabel)
axs[0, 0].grid(axis="y", alpha=.25)
# KDE hist – negative class
sns.histplot(
data=df[df[target] == neg],
x=predictor,
bins=bins,
kde=True,
stat="density",
ax=axs[0, 1],
color=pal[1],
edgecolor="white",
alpha=0.85
)
axs[0, 1].set_title(f"{predictor} | {target} = {neg}", weight="bold")
axs[0, 1].set_xlabel(xlabel)
axs[0, 1].grid(axis="y", alpha=.25)
# Boxplot with outliers
sns.boxplot(
data=df, x=target, y=predictor,
hue=target, dodge=False, palette=pal, ax=axs[1, 0],
)
if axs[1, 0].legend_ is not None:
axs[1, 0].legend_.remove()
axs[1, 0].set_title("Boxplot w.r.t. target", weight="bold")
axs[1, 0].grid(axis="y", alpha=.25)
# Boxplot WITHOUT outliers
sns.boxplot(
data=df, x=target, y=predictor,
hue=target, dodge=False, palette=pal, showfliers=False, ax=axs[1, 1],
)
if axs[1, 1].legend_ is not None:
axs[1, 1].legend_.remove()
axs[1, 1].set_title("Boxplot (no outliers) w.r.t. target", weight="bold")
axs[1, 1].grid(axis="y", alpha=.25)
sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=1.1)
plt.show()
return fig, axs
bivariate_with_target(
df_can, predictor="no_of_employees", target=TARGET_LBL,
log_x=True, bins=60, colors=("#007ACC", "#FF6600")
)
(<Figure size 1200x800 with 4 Axes>,
array([[<Axes: title={'center': 'no_of_employees | _case_label = Certified'}, xlabel='log1p(no_of_employees)', ylabel='Density'>,
<Axes: title={'center': 'no_of_employees | _case_label = Denied'}, xlabel='log1p(no_of_employees)', ylabel='Density'>],
[<Axes: title={'center': 'Boxplot w.r.t. target'}, xlabel='_case_label', ylabel='no_of_employees'>,
<Axes: title={'center': 'Boxplot (no outliers) w.r.t. target'}, xlabel='_case_label', ylabel='no_of_employees'>]],
dtype=object))
def stacked_barplot(data: pd.DataFrame, predictor: str, target: str):
"""
Create and print category counts, then plot a 100% stacked bar chart.
Parameters
----------
data : DataFrame
The dataframe containing predictor and target columns.
predictor : str
Independent (categorical) variable.
target : str
Target (binary/categorical) variable.
"""
# sanity check
if predictor not in data.columns or target not in data.columns:
raise KeyError(f"Columns '{predictor}' or '{target}' not found in dataframe.")
# cross-tabulations
raw_tab = pd.crosstab(data[predictor], data[target], margins=True)
print(raw_tab)
print("-" * 120)
norm_tab = pd.crosstab(data[predictor], data[target], normalize="index") * 100
# Sort by the proportion of the minority/second class for consistency
if norm_tab.shape[1] > 1:
# choose the smallest or last column for sorting
sort_col = norm_tab.columns[-1]
norm_tab = norm_tab.sort_values(by=sort_col, ascending=False)
# Plot
fig, ax = plt.subplots(figsize=(max(8, len(norm_tab)*0.7), 5))
norm_tab.plot(kind="bar", stacked=True, ax=ax, color=["#007ACC", "#FF6600"])
sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=1.1)
# label formatting
ax.set_ylabel("Percentage")
ax.set_xlabel(predictor)
ax.set_title(f"{predictor} vs {target} — Class Composition (100%)", fontweight="bold")
ax.legend(title=target, bbox_to_anchor=(1.02, 1), loc="upper left", frameon=False)
ax.grid(axis="y", alpha=0.25)
plt.tight_layout()
plt.tick_params(axis='x', rotation=45)
plt.show()
stacked_barplot(df, "education_of_employee", "_case_label")
stacked_barplot(df, "region_of_employment", "_case_label")
stacked_barplot(df, "has_job_experience", "_case_label")
_case_label Certified Denied All education_of_employee Bachelor's 6367 3867 10234 Doctorate 1912 280 2192 High School 1164 2256 3420 Master's 7575 2059 9634 All 17018 8462 25480 ------------------------------------------------------------------------------------------------------------------------
_case_label Certified Denied All region_of_employment Island 226 149 375 Midwest 3253 1054 4307 Northeast 4526 2669 7195 South 4913 2104 7017 West 4100 2486 6586 All 17018 8462 25480 ------------------------------------------------------------------------------------------------------------------------
_case_label Certified Denied All has_job_experience N 5994 4684 10678 Y 11024 3778 14802 All 17018 8462 25480 ------------------------------------------------------------------------------------------------------------------------
Does higher education increase the chances of visa certification for well-paid jobs abroad?¶
# Higher education vs visa certification
# Use cleaned education + canonical target label
edu_series = (df_can["_edu"] if "_edu" in df_can.columns else df_can["education_of_employee"]).astype("string")
target_col = TARGET_LBL if "TARGET_LBL" in globals() else "_case_label"
y = (df_can[target_col].astype("string").str.strip().str.title() == "Certified").astype(int)
# Natural order
order = ["High School", "Bachelor's", "Master's", "Doctorate"]
# Compute rate, n, and 95% CI
stats = (
pd.DataFrame({"edu": edu_series, "y": y})
.dropna(subset=["edu"])
.groupby("edu")
.agg(p=("y", "mean"), n=("y", "size"))
.reindex(order)
)
se = np.sqrt(stats["p"] * (1 - stats["p"]) / stats["n"])
ci = 1.96 * se
# Drop categories with n == 0 or NaN before plotting
plot_stats = stats.copy()
plot_stats["ci"] = ci
plot_stats = plot_stats.loc[plot_stats["n"].fillna(0) > 0]
# Plot
fig, ax = plt.subplots(figsize=(7, 4))
ax.errorbar(
plot_stats["p"].values,
np.arange(len(plot_stats)),
xerr=plot_stats["ci"].values,
fmt="o", capsize=4,
color="#007ACC", ecolor="#005A99",
elinewidth=1.5, markeredgewidth=1.5
)
ax.set_yticks(np.arange(len(plot_stats)))
ax.set_yticklabels(plot_stats.index, fontsize=10)
ax.set_xlabel("Certification Rate", fontsize=11)
ax.set_title("Visa Approval Rate by Education (±95% CI)", fontsize=12, weight="bold")
ax.set_xlim(0, 1)
ax.grid(axis="x", alpha=0.25)
sns.despine()
plt.tight_layout()
plt.show()
# Build summary table: rate, n, 95% CI
_overall = y.mean()
def _wilson_ci(k, n, z=1.96):
if n == 0: return (np.nan, np.nan)
p = k / n
denom = 1 + (z**2)/n
center = (p + (z**2)/(2*n)) / denom
half = z * np.sqrt((p*(1-p) + (z**2)/(4*n)) / n) / denom
return center - half, center + half
k = (pd.DataFrame({"edu": edu_series, "y": y})
.dropna(subset=["edu"])
.groupby("edu")["y"].sum()
.reindex(order))
wil = [_wilson_ci(int(k.iloc[i]), int(stats["n"].iloc[i])) for i in range(len(stats))]
ci_low = [w[0] for w in wil]
ci_high = [w[1] for w in wil]
rate_df = (
pd.DataFrame({
"rate": stats["p"],
"n": stats["n"].astype("Int64"),
"ci_low": ci_low,
"ci_high": ci_high,
"rate_pct": (100*stats["p"]).round(1),
"lift_pp": ((stats["p"] - _overall) * 100).round(1),
})
.dropna(subset=["n"])
.round(4)
)
rate_df.index.name = "education"
rate_df
| rate | n | ci_low | ci_high | rate_pct | lift_pp | |
|---|---|---|---|---|---|---|
| education | ||||||
| High School | 0.3404 | 3420 | 0.3247 | 0.3564 | 34.0 | -32.8 |
| Bachelor's | 0.6221 | 10234 | 0.6127 | 0.6315 | 62.2 | -4.6 |
| Master's | 0.7863 | 9634 | 0.7780 | 0.7943 | 78.6 | 11.8 |
| Doctorate | 0.8723 | 2192 | 0.8576 | 0.8856 | 87.2 | 20.4 |
Yes, combined with the previously observed baseline of 66.8% the findings show:
High school has only a ~34% signal with -32.77pp
Bachelors shows the signal is ~62% with -4.8pp
Masters delivers the first signal lift at ~79% with 12.2pp
Doctorate is the strongest signal lift at ~87% with 20.43pp
This indicates stronger rates of visa certification with every progression of education from high school in steps of 3x, 7x, and 13x respectively
How does visa status vary across different continents?¶
# Pretty categorical bivariate for DF + TARGET_LBL
import matplotlib.ticker as mticker
def bivariate_cat_pretty(
data: pd.DataFrame,
predictor: str,
target: str,
*,
colors=("#2B83F6", "#FF8C42"),
order_by="rate",
ascending=True,
return_stats=False,
):
# Checks
assert predictor in data.columns, f"Missing predictor: {predictor}"
assert target in data.columns, f"Missing target: {target}"
df = data[[predictor, target]].dropna()
y = df[target].astype(str).str.strip().str.title()
classes = y.unique()
assert len(classes) == 2, f"Target must be binary; got {sorted(map(str, classes))}"
pos, neg = ("Certified", "Denied") if {"Certified","Denied"}.issubset(set(classes)) else tuple(sorted(classes))
x = df[predictor].astype("string").str.strip().replace("", "NaN")
ctab = pd.crosstab(x, y)
for c in (pos, neg):
if c not in ctab.columns: ctab[c] = 0
ctab = ctab[[pos, neg]]
n = ctab.sum(axis=1)
p = (ctab[pos] / n).fillna(0.0)
se = np.sqrt(p * (1 - p) / n.replace(0, np.nan))
ci = (1.96 * se).fillna(0.0)
# ordering
if isinstance(order_by, (list, tuple, pd.Index)):
order = list(order_by)
elif order_by == "alpha":
order = sorted(ctab.index.astype(str))
elif order_by == "count":
order = n.sort_values(ascending=ascending).index.tolist()
else: # "rate"
order = p.sort_values(ascending=ascending).index.tolist()
ctab, n, p, ci = ctab.reindex(order), n.reindex(order), p.reindex(order), ci.reindex(order)
perc = (ctab.div(n.replace(0, np.nan), axis=0) * 100).fillna(0)
# Plot
pretty_pred = predictor.lstrip("_").replace("_", " ").title()
fig_w = max(8, 0.9 * len(order))
fig, (ax0, ax1) = plt.subplots(2, 1, figsize=(fig_w, 7), gridspec_kw={"height_ratios":[3,2]}, constrained_layout=True)
sns.set_style("whitegrid"); sns.set_context("notebook", font_scale=1.05)
# Top: 100% stacked with n labels
bottom = np.zeros(len(order))
for cls, col in [(pos, colors[0]), (neg, colors[1])]:
vals = perc[cls].to_numpy()
ax0.bar(perc.index.astype(str), vals, bottom=bottom, label=cls, color=col, edgecolor="white", alpha=.9)
bottom += vals
ax0.set_ylim(0, 102)
for i, n_i in enumerate(n.to_numpy()):
ax0.annotate(f"n={int(n_i)}", (i, 101), ha="center", va="bottom", fontsize=9, clip_on=False)
ax0.set_ylabel("Class mix (%)")
ax0.yaxis.set_major_formatter(mticker.PercentFormatter(xmax=100))
ax0.set_title(f"{pretty_pred} — class composition", weight="bold", pad=18)
ax0.legend(title=target, bbox_to_anchor=(1.02, 1), loc="upper left")
ax0.tick_params(axis="x", rotation=45)
ax0.grid(axis="y", alpha=.25)
# Bottom: approval rate ±95% CI sorted
y_pos = np.arange(len(p))
ax1.errorbar(p.values, y_pos, xerr=ci.values, fmt="o", ms=6,
mfc="white", mec="#1f77b4", ecolor="#1f77b4", capsize=4, lw=1.5)
ax1.set_yticks(y_pos); ax1.set_yticklabels(p.index.astype(str))
ax1.set_xlim(0, 1); ax1.set_xlabel("P(Certified)")
ax1.set_title(f"Approval rate by {pretty_pred} (±95% CI)", weight="bold")
ax1.grid(axis="x", alpha=.25)
sns.despine()
if return_stats:
stats = pd.DataFrame({"n": n.astype(int), "p_certified": p, "ci_low": (p-ci).clip(0,1), "ci_high": (p+ci).clip(0,1)})
return fig, (ax0, ax1), stats
return fig, (ax0, ax1)
# CONTINENT ONLY
fig, (ax0, ax1) = bivariate_cat_pretty(
DF, predictor="_continent", target=TARGET_LBL,
colors=("#2B83F6", "#FF8C42"),
order_by="rate", ascending=True
)
plt.show()
# Baseline overall certification rate
_overall = (DF[TARGET_LBL].astype(str).str.strip().str.title() == "Certified").mean()
# Wilson 95% CI for a proportion
def _wilson_ci(k, n, z=1.96):
if n == 0:
return (np.nan, np.nan)
p = k / n
denom = 1 + (z**2)/n
center = (p + (z**2)/(2*n)) / denom
half = z * np.sqrt((p*(1-p) + (z**2)/(4*n)) / n) / denom
return center - half, center + half
# Continent table counts, rate, 95% CI, lift vs overall
_tmp = DF[["continent", TARGET_LBL]].dropna()
_y = (_tmp[TARGET_LBL].astype(str).str.strip().str.title() == "Certified").astype(int)
_g = (
pd.concat([_tmp[["continent"]].reset_index(drop=True), _y.rename("y")], axis=1)
.groupby("continent", dropna=False)["y"]
.agg(n="count", certified="sum", rate="mean")
.reset_index()
)
_cis = _g.apply(lambda r: _wilson_ci(r["certified"], r["n"]), axis=1)
_g["ci_low"] = [c[0] for c in _cis]
_g["ci_high"] = [c[1] for c in _cis]
_g["rate_pct"] = (100 * _g["rate"]).round(1)
_g["lift_pp"] = ((_g["rate"] - _overall) * 100).round(1)
continent_tbl = _g[["continent","n","rate","rate_pct","ci_low","ci_high","lift_pp"]] \
.sort_values("rate", ascending=True)
display(continent_tbl)
continent_tbl.to_csv("continent_approval_table.csv", index=False)
| continent | n | rate | rate_pct | ci_low | ci_high | lift_pp | |
|---|---|---|---|---|---|---|---|
| 5 | South America | 852 | 0.578638 | 57.9 | 0.545202 | 0.611369 | -8.9 |
| 3 | North America | 3292 | 0.618773 | 61.9 | 0.602052 | 0.635217 | -4.9 |
| 4 | Oceania | 192 | 0.635417 | 63.5 | 0.565297 | 0.700224 | -3.2 |
| 1 | Asia | 16861 | 0.653105 | 65.3 | 0.645886 | 0.660254 | -1.5 |
| 0 | Africa | 551 | 0.720508 | 72.1 | 0.681610 | 0.756353 | 5.3 |
| 2 | Europe | 3732 | 0.792337 | 79.2 | 0.779025 | 0.805047 | 12.4 |
Yes, combined with the previously observed baseline of 66.8% the findings show:
Europe had the highest signal lift at ~79.2% with +12.4pp
Africa came in with a lift signal of ~72.1% with +5.3pp
Asia the signal is ~65.3% with −1.5pp
Oceania the signal is ~63.5% with −3.2pp
North America the signal is ~61.9% with −4.9pp
South America the signal is ~57.9% with −8.9pp
This indicates continent-level differences in certification, with Europe showing the strongest positive lift and South America the lowest
Does having prior work experience influence the chances of visa certification for career opportunities abroad?¶
# Approval rate by job experience
assert TARGET_LBL in DF.columns, f"Missing target: {TARGET_LBL}"
exp_col = "_job_exp" if "_job_exp" in DF.columns else "has_job_experience"
assert exp_col in DF.columns, f"Missing experience column: {exp_col}"
# Map Y/N -> Yes/No keep Unknown, and set a stable categorical order
exp = (DF[exp_col].astype("string").str.strip().str.upper()
.map({"Y": "Yes", "N": "No"})
.fillna("Unknown")
.astype("category")
.cat.set_categories(["Yes", "No", "Unknown"], ordered=True))
# Binary target
y = (DF[TARGET_LBL].astype("string").str.strip().str.title().eq("Certified")).astype("int8")
# Group with explicit observed=True and get rate + n
grp = (pd.DataFrame({"exp": exp, "y": y})
.groupby("exp", observed=True)
.agg(rate=("y", "mean"), n=("y", "size")))
# 95% CI avoid divide-by-zero
se = np.sqrt(grp["rate"] * (1 - grp["rate"]) / grp["n"].replace(0, np.nan))
ci = 1.96 * se
grp["ci"] = ci
# Plot and drop empty categories to avoid blank bars
plot_df = grp.loc[grp["n"] > 0]
fig, ax = plt.subplots(figsize=(6, 4))
ax.bar(plot_df.index.astype(str), plot_df["rate"].values, edgecolor="white")
ax.errorbar(plot_df.index.astype(str), plot_df["rate"].values,
yerr=plot_df["ci"].values, fmt="none", capsize=3, color="k")
ax.set_ylim(0, 1)
ax.set_ylabel("Certification rate")
ax.set_title("Approval rate by job experience (±95% CI)")
ax.grid(axis="y", alpha=.25)
plt.tight_layout()
plt.show()
_overall = y.mean()
def _wilson_ci(k, n, z=1.96):
if n == 0:
return (np.nan, np.nan)
p = k / n
denom = 1 + (z**2)/n
center = (p + (z**2)/(2*n)) / denom
half = z * np.sqrt((p*(1-p) + (z**2)/(4*n)) / n) / denom
return center - half, center + half
order = ["Yes", "No", "Unknown"]
n_series = grp["n"].reindex(order).astype("Int64")
rate_series = grp["rate"].reindex(order)
k_series = (
pd.DataFrame({"exp": exp, "y": y})
.groupby("exp", observed=True)["y"]
.sum()
.reindex(order)
.fillna(0)
.astype(int)
)
cis = [ _wilson_ci(int(k_series.iloc[i]), int(n_series.iloc[i] if pd.notna(n_series.iloc[i]) else 0))
for i in range(len(order)) ]
ci_low = [c[0] for c in cis]
ci_high = [c[1] for c in cis]
exp_tbl = (
pd.DataFrame({
"rate": rate_series,
"n": n_series,
"ci_low": ci_low,
"ci_high": ci_high,
"rate_pct": (100*rate_series).round(1),
"lift_pp": ((rate_series - _overall)*100).round(1),
})
.dropna(subset=["n"])
.round(4)
)
exp_tbl.index.name = "job_experience"
exp_tbl
| rate | n | ci_low | ci_high | rate_pct | lift_pp | |
|---|---|---|---|---|---|---|
| job_experience | ||||||
| Yes | 0.7448 | 14802 | 0.7377 | 0.7517 | 74.5 | 7.7 |
| No | 0.5613 | 10678 | 0.5519 | 0.5707 | 56.1 | -10.7 |
Yes, combined with the previously observed baseline of 66.8% the findings show:
Yes (prior experience) the signal lift is ~74.5% with +7.7pp
No (no experience) the signal is ~56.1% with −10.7pp
This indicates a strong positive effect of prior work experience on certification likelihood
Is the prevailing wage consistent across all regions of the US?¶
# Prevailing wage by region
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
import scipy.stats as st
# Columns
region_col = "_region" if "_region" in DF.columns else "region_of_employment"
assert region_col in DF.columns, f"Missing region column: {region_col}"
if "wage_usd_annual" in DF.columns:
tmp = DF[[region_col, "wage_usd_annual"]].copy()
else:
# Compute annual wage from prevailing_wage + unit of wage
assert "prevailing_wage" in DF.columns, "Need 'prevailing_wage' to compute annual wage."
uow_col = "_uow" if "_uow" in DF.columns else ("unit_of_wage" if "unit_of_wage" in DF.columns else None)
assert uow_col is not None, "Need wage unit column '_uow' or 'unit_of_wage'."
_factor = {
"Hour": 2080.0, "Week": 52.0, "Month": 12.0, "Year": 1.0, "Yr": 1.0,
"Annual": 1.0, "Per Year": 1.0, "Bi-Week": 26.0, "Biweekly": 26.0, "Day": 260.0,
}
tmp = DF[[region_col, "prevailing_wage", uow_col]].copy()
tmp["wage_usd_annual"] = (
pd.to_numeric(tmp["prevailing_wage"], errors="coerce")
* tmp[uow_col].astype("string").str.strip().str.title().map(_factor)
)
# Tidy & log transform
tmp = tmp.dropna(subset=[region_col, "wage_usd_annual"])
tmp[region_col] = tmp[region_col].astype("string").str.strip()
tmp["log_wage"] = np.log1p(tmp["wage_usd_annual"].clip(lower=0))
# Order regions by median wage for a neat plot
order = (tmp.groupby(region_col, observed=True)["wage_usd_annual"]
.median().sort_values().index.tolist())
tmp[region_col] = pd.Categorical(tmp[region_col], categories=order, ordered=True)
# Boxplot
plt.figure(figsize=(8, 4.5))
sns.boxplot(data=tmp, x=region_col, y="log_wage")
plt.title("Yearly wage (log1p) by region")
plt.grid(axis="y", alpha=.25)
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
# Kruskal–Wallis across regions
groups = [g["log_wage"].to_numpy() for _, g in tmp.groupby(region_col, observed=True)]
H, p = st.kruskal(*groups, nan_policy="omit")
k, n = len(groups), int(tmp.shape[0])
eta2 = (H - k + 1) / max(n - k, 1) # ε² effect size
print(f"Kruskal–Wallis H={H:.2f}, p={p:.3g} | epsilon^2≈{eta2:.3f} (k={k}, n={n})")
# Robust table: prevailing wage by region (median, 95% CI, lift vs overall median)
dfw = tmp[[region_col, "wage_usd_annual"]].dropna().copy()
dfw["wage_usd_annual"] = pd.to_numeric(dfw["wage_usd_annual"], errors="coerce")
dfw = dfw.dropna()
_overall_med = float(dfw["wage_usd_annual"].median())
def _median_ci(x, z=1.96):
x = np.asarray(x, dtype=float)
x = x[~np.isnan(x)]
n = x.size
if n == 0:
return (np.nan, np.nan, np.nan)
xs = np.sort(x)
med = float(np.median(xs))
k = int(np.floor(0.5*n - z*np.sqrt(n)/2.0))
m = int(np.ceil (0.5*n + z*np.sqrt(n)/2.0)) - 1
k = max(k, 0); m = min(m, n-1)
return med, float(xs[k]), float(xs[m])
rows = []
for r, g in dfw.groupby(region_col, observed=True):
med, lo, hi = _median_ci(g["wage_usd_annual"].values)
rows.append((str(r), int(len(g)), med, lo, hi, med - _overall_med))
region_wage_tbl = (
pd.DataFrame(rows, columns=[region_col, "n", "median_usd", "ci_low_usd", "ci_high_usd", "lift_usd"])
.sort_values("median_usd", ascending=False)
.reset_index(drop=True)
.assign(
median_usd=lambda d: d["median_usd"].round(2),
ci_low_usd=lambda d: d["ci_low_usd"].round(2),
ci_high_usd=lambda d: d["ci_high_usd"].round(2),
lift_usd=lambda d: d["lift_usd"].round(2),
)
)
region_wage_tbl
# Sensitivity 1 — Year-only rows (no annualization)
reg = "_region" if "_region" in DF.columns else "region_of_employment"
uow = "_uow" if "_uow" in DF.columns else "unit_of_wage"
year_only = DF[(DF[uow].astype(str).str.title()=="Year") & DF["prevailing_wage"].notna()]
tab_year = (year_only[[reg,"prevailing_wage"]]
.assign(wage_usd_annual=pd.to_numeric(year_only["prevailing_wage"], errors="coerce"))
.dropna()
.groupby(reg)["wage_usd_annual"]
.agg(n="size", median_usd="median")
.sort_values("median_usd", ascending=False))
print("\nYear-only rows (no annualization)")
display(tab_year)
# Sensitivity 2 — Winsorize annual wages (0.5% tails) before medians
w = dfw["wage_usd_annual"].values
lo, hi = np.percentile(w[~np.isnan(w)], [0.5, 99.5])
dfw_wins = dfw.assign(w_wins=np.clip(dfw["wage_usd_annual"], lo, hi))
tab_wins = (dfw_wins.groupby(reg, observed=True)["w_wins"]
.agg(n="size", median_usd="median")
.sort_values("median_usd", ascending=False))
print("\nWinsorized wages (0.5% tails)")
display(tab_wins)
# Sensitivity 3 — Bootstrap CI and difference: Island vs Midwest medians
rng = np.random.default_rng(72)
def boot_med(x, B=5000):
x = np.asarray(x, float); x = x[~np.isnan(x)]
n = x.size; idx = rng.integers(0, n, size=(B, n))
return np.median(x[idx], axis=1)
m_island = dfw.loc[dfw[reg]=="Island", "wage_usd_annual"].values
m_midwest = dfw.loc[dfw[reg]=="Midwest", "wage_usd_annual"].values
bI = boot_med(m_island); bM = boot_med(m_midwest)
diff = bI - bM
ci = np.quantile(diff, [0.025, 0.975])
p = (np.abs(diff).mean() >= 0).item() # not a formal p, just keep CI
print(f"Island–Midwest median diff (bootstrap): {np.median(m_island)-np.median(m_midwest):.0f} "
f"CI[{ci[0]:.0f}, {ci[1]:.0f}]")
# Sensitivity 4 — Control for mix: median wage by region within education
edu = "_edu" if "_edu" in DF.columns else "education_of_employee"
dfw2 = DF[[reg, edu, "prevailing_wage", uow]].dropna()
factor = {"Hour":2080,"Week":52,"Month":12,"Year":1}
dfw2["wage_usd_annual"] = (pd.to_numeric(dfw2["prevailing_wage"], errors="coerce") *
dfw2[uow].astype(str).str.title().map(factor))
mix_tab = (dfw2.groupby([reg, edu], observed=True)["wage_usd_annual"]
.median().unstack(edu).round(0))
display(mix_tab)
# Sanity scan — Island distribution, extremes, duplicates
island = dfw.loc[dfw[reg]=="Island", "wage_usd_annual"].sort_values()
print(island.describe(percentiles=[.01,.05,.5,.95,.99]).round(2))
print("Top 5:", island.tail(5).round(0).tolist())
print("Bottom 5:", island.head(5).round(0).tolist())
dup_rate = (island.value_counts().max() / max(1, island.size))
print(f"Duplicate-value concentration: {dup_rate:.3f}")
_overall_med = float(dfw["wage_usd_annual"].median())
print(f"Baseline median (all regions): ${_overall_med:,.0f}")
Kruskal–Wallis H=268.19, p=7.81e-57 | epsilon^2≈0.010 (k=5, n=25480) Year-only rows (no annualization)
| n | median_usd | |
|---|---|---|
| _region | ||
| Island | 354 | 93528.190 |
| Midwest | 4116 | 91357.070 |
| South | 6225 | 76946.330 |
| Northeast | 6209 | 72129.750 |
| West | 6058 | 68485.425 |
Winsorized wages (0.5% tails)
| n | median_usd | |
|---|---|---|
| _region | ||
| Island | 375 | 96317.45 |
| Midwest | 4307 | 94520.54 |
| South | 7017 | 84812.35 |
| Northeast | 7195 | 81428.85 |
| West | 6586 | 73867.56 |
Island–Midwest median diff (bootstrap): 1797 CI[-4287, 7066]
| _edu | Bachelor's | Doctorate | High School | Master's |
|---|---|---|---|---|
| _region | ||||
| Island | 99102.0 | 83226.0 | 98736.0 | 96988.0 |
| Midwest | 92423.0 | 80998.0 | 95779.0 | 97075.0 |
| Northeast | 83410.0 | 60821.0 | 84834.0 | 83254.0 |
| South | 85733.0 | 67438.0 | 84296.0 | 87380.0 |
| West | 78231.0 | 53483.0 | 73431.0 | 73509.0 |
count 375.00 mean 170732.10 std 657212.72 min 1985.30 1% 7078.82 5% 24117.07 50% 96317.45 95% 261664.73 99% 1437960.85 max 10074525.76 Name: wage_usd_annual, dtype: float64 Top 5: [1413260.0, 1508265.0, 2420985.0, 7199995.0, 10074526.0] Bottom 5: [1985.0, 4508.0, 4890.0, 5274.0, 7713.0] Duplicate-value concentration: 0.003 Baseline median (all regions): $82,839
Yes, combined with the previously observed baseline median of $82,839 the findings show:
Island has the highest signal at ~$96,317 with +$13,478
Midwest the signal is ~$94,521 with +$11,681
South the signal is ~$84,812 with +$1,973
Northeast the signal is ~$81,429 with −$1,411
West the signal is lowest with ~$73,868 with −$8,972
Confidence shows Island and Midwest are not decisively different the bootstrap CI for Island − Midwest spans zero so treating that gap as suggestive not conclusive
The pattern survives Year-only rows with no conversions, persists within education levels, and uses a median that is robust to wage outliers.
Despite higher wages, Island has the lowest approval rate ~60% making wage and approval separate signals here.
Does visa status vary with changes in the prevailing wage set to protect both local talent and foreign workers?¶
# Prevailing wage → approval (quartiles) visual + robust table
# ensure annualized wage
uow_col = "_uow" if "_uow" in DF.columns else ("unit_of_wage" if "unit_of_wage" in DF.columns else None)
if "_wage_yearly" in DF.columns:
wage = pd.to_numeric(DF["_wage_yearly"], errors="coerce")
else:
assert "prevailing_wage" in DF.columns and uow_col is not None, "Need wage + unit columns"
_factor = {"Hour":2080.0,"Week":52.0,"Month":12.0,"Year":1.0,"Yr":1.0,"Annual":1.0,"Per Year":1.0,"Bi-Week":26.0,"Biweekly":26.0,"Day":260.0}
wage = pd.to_numeric(DF["prevailing_wage"], errors="coerce") * DF[uow_col].astype(str).str.strip().str.title().map(_factor)
# target
yy = (DF[TARGET_LBL].astype(str).str.strip().str.title()=="Certified").astype("int8")
# quartiles
bands = pd.qcut(wage, 4, labels=["Q1 (lowest)","Q2","Q3","Q4 (highest)"], duplicates="drop")
# group stats
g = (pd.DataFrame({"band": bands, "y": yy})
.dropna(subset=["band"])
.groupby("band", observed=True)["y"]
.agg(n="count", certified="sum", rate="mean")
.reset_index())
# Wilson CI
def _wilson(k,n,z=1.96):
if n==0: return (np.nan,np.nan)
p=k/n; d=1+z*z/n
c=(p+z*z/(2*n))/d
h=z*np.sqrt((p*(1-p)+z*z/(4*n))/n)/d
return (c-h, c+h)
cis = g.apply(lambda r: _wilson(int(r["certified"]), int(r["n"])), axis=1)
g["ci_low"] = [c[0] for c in cis]
g["ci_high"] = [c[1] for c in cis]
# lift vs baseline
overall = yy.mean()
g["rate_pct"] = (100*g["rate"]).round(1)
g["lift_pp"] = ((g["rate"] - overall)*100).round(1)
wage_quartile_tbl = g[["band","n","rate","rate_pct","ci_low","ci_high","lift_pp"]]
# visual
plt.figure(figsize=(6.8, 4))
x = np.arange(len(wage_quartile_tbl))
plt.errorbar(
wage_quartile_tbl["rate"].values,
x,
xerr=(wage_quartile_tbl["rate"]-wage_quartile_tbl["ci_low"]).values,
fmt="o", capsize=4, color="#007ACC", ecolor="#005A99", elinewidth=1.5, markeredgewidth=1.5
)
plt.yticks(x, wage_quartile_tbl["band"].astype(str).tolist())
plt.xlabel("Certification Rate")
plt.title("Visa Approval Rate by Wage Quartile (±95% CI)")
plt.xlim(0, 1)
plt.grid(axis="x", alpha=.25)
sns.despine()
plt.tight_layout()
plt.show()
# table
wage_quartile_tbl
| band | n | rate | rate_pct | ci_low | ci_high | lift_pp | |
|---|---|---|---|---|---|---|---|
| 0 | Q1 (lowest) | 6370 | 0.711617 | 71.2 | 0.700367 | 0.722612 | 4.4 |
| 1 | Q2 | 6370 | 0.696389 | 69.6 | 0.684982 | 0.707560 | 2.8 |
| 2 | Q3 | 6370 | 0.684772 | 68.5 | 0.673254 | 0.696068 | 1.7 |
| 3 | Q4 (highest) | 6370 | 0.578807 | 57.9 | 0.566638 | 0.590881 | -8.9 |
No, combined with the previously observed baseline of 66.8% the findings show:
Q1 (lowest) the signal is ~71.2% with +4.4pp
Q2 the signal is ~69.6% with +2.8pp
Q3 the signal is ~68.5% with +1.7pp
Q4 (highest) the signal is ~57.9% with −8.9pp
This tells us wage does not boost certification in the raw split—top-wage roles actually sit below baseline while lower/mid bands are modestly above
Does the unit of prevailing wage (Hourly, Weekly, etc.) have any impact on the likelihood of visa application certification?¶
# Unit-of-wage → approval
uow_col = "_uow" if "_uow" in DF.columns else "unit_of_wage"
yy = (DF[TARGET_LBL].astype(str).str.strip().str.title()=="Certified").astype(int)
overall = yy.mean()
# normalize unit labels
uow = (DF[uow_col].astype("string").str.strip().str.title()
.replace({"Yr":"Year","Per Year":"Year","Annual":"Year","Bi-Week":"Biweekly"}))
def _wilson(k,n,z=1.96):
if n==0: return (np.nan,np.nan)
p=k/n; d=1+z*z/n
c=(p+z*z/(2*n))/d
h=z*np.sqrt((p*(1-p)+z*z/(4*n))/n)/d
return (c-h, c+h)
g = (pd.DataFrame({"unit": uow, "y": yy})
.dropna(subset=["unit"])
.groupby("unit", observed=True)["y"]
.agg(n="count", certified="sum", rate="mean")
.reset_index())
cis = g.apply(lambda r: _wilson(int(r["certified"]), int(r["n"])), axis=1)
g["ci_low"] = [c[0] for c in cis]
g["ci_high"] = [c[1] for c in cis]
g["rate_pct"] = (100*g["rate"]).round(1)
g["lift_pp"] = ((g["rate"] - overall)*100).round(1)
uow_tbl = g[["unit","n","rate","rate_pct","ci_low","ci_high","lift_pp"]] \
.sort_values("rate", ascending=False)
# Bar + error bars for unit-of-wage approval
_plot = uow_tbl.copy()
x = np.arange(len(_plot))
y = _plot["rate"].values
yerr = ( _plot["rate"] - _plot["ci_low"] ).values # symmetric enough here
plt.figure(figsize=(6.6, 4))
plt.bar(x, y, edgecolor="white")
plt.errorbar(x, y, yerr=yerr, fmt="none", capsize=3, color="k")
plt.xticks(x, _plot["unit"].astype(str).tolist())
plt.ylim(0, 1)
plt.ylabel("Certification rate")
plt.title("Approval rate by unit of wage (±95% CI)")
plt.grid(axis="y", alpha=.25)
plt.tight_layout()
plt.show()
display(uow_tbl)
| unit | n | rate | rate_pct | ci_low | ci_high | lift_pp | |
|---|---|---|---|---|---|---|---|
| 3 | Year | 22962 | 0.698850 | 69.9 | 0.692884 | 0.704750 | 3.1 |
| 2 | Week | 272 | 0.621324 | 62.1 | 0.562366 | 0.676901 | -4.7 |
| 1 | Month | 89 | 0.617978 | 61.8 | 0.514139 | 0.712052 | -5.0 |
| 0 | Hour | 2157 | 0.346314 | 34.6 | 0.326524 | 0.366651 | -32.2 |
Yes, combined with the observed baseline of 66.8% the findings show:
Year the signal is ~69.9% with +3.1pp
Week the signal is ~62.1% with −4.7pp
Month the signal is ~61.8% with −5.0pp
Hour the signal is ~34.6% with −32.2pp
This is a unit/mix effect not causation - going to annualize wage for modeling and treat unit as diagnostic
Data Pre-processing¶
Data Preparation for modeling + Outlier Check¶
# Data Pre-processing → Outlier Check
# Wage → annual USD from unit_of_wage / _uow
uow_col = "_uow" if "_uow" in DF.columns else "unit_of_wage"
def _unit_to_year_mult(u):
if pd.isna(u): return np.nan
s = str(u).strip().lower()
if s.startswith("hour") or s in {"hr","h","per hour"}: return 2080
if s.startswith("week") or s in {"wk","w","per week"}: return 52
if s.startswith("month") or s in {"mo","m","per month"}: return 12
if s.startswith("year") or s in {"yr","y","per year","annual"}: return 1
# fallback by first letter
return {"h":2080,"w":52,"m":12,"y":1}.get(s[:1], np.nan)
mult = DF[uow_col].map(_unit_to_year_mult)
DF["wage_usd_yr"] = DF["prevailing_wage"] * mult
DF["log_wage_usd_yr"] = np.log1p(DF["wage_usd_yr"])
# Employees fix negatives
DF["employees_raw"] = DF["no_of_employees"]
DF["employees_clip0"] = DF["no_of_employees"].clip(lower=0)
DF["log_employees"] = np.log1p(DF["employees_clip0"])
# Company age filing-year proxy 2016
FILING_YEAR_PROXY = 2016
DF["company_age"] = (FILING_YEAR_PROXY - DF["yr_of_estab"]).astype("float")
DF["company_age"] = DF["company_age"].where(DF["company_age"].between(0, 200)) # clip implausible tails
DF["log_company_age"] = np.log1p(DF["company_age"])
# IQR outlier summary (1.5×IQR fences
def iqr_outlier_mask(x):
x = pd.to_numeric(x, errors="coerce")
q1, q3 = np.nanpercentile(x, [25, 75])
iqr = q3 - q1
lo, hi = q1 - 1.5*iqr, q3 + 1.5*iqr
return (x < lo) | (x > hi)
def _report(block_name, series_map):
rows = []
n = len(DF)
for label, s in series_map.items():
m = iqr_outlier_mask(s)
n_out = int(pd.Series(m).sum())
rows.append([block_name, label, n, n_out, round(100*n_out/n, 2)])
return pd.DataFrame(rows, columns=["feature","variant","n","n_outliers","pct_outliers"])
rep_wage = _report("wage", {
"prevailing_wage (raw, mixed units)": DF["prevailing_wage"],
"wage_usd_yr (normalized)": DF["wage_usd_yr"],
"log_wage_usd_yr": DF["log_wage_usd_yr"],
})
rep_emp = _report("employees", {
"no_of_employees (raw)": DF["employees_raw"],
"employees_clip0": DF["employees_clip0"],
"log_employees": DF["log_employees"],
})
rep_age = _report("company_age", {
"yr_of_estab (raw)": DF["yr_of_estab"],
"company_age": DF["company_age"],
"log_company_age": DF["log_company_age"],
})
OUTLIER_SUMMARY = pd.concat([rep_wage, rep_emp, rep_age], ignore_index=True)
print("=== OUTLIER SUMMARY (IQR, 1.5×IQR) ===")
display(OUTLIER_SUMMARY)
# Quick visuals boxplots: raw → normalized → log1p
fig, axes = plt.subplots(1, 3, figsize=(13, 4))
# Wage
axes[0].boxplot(
[DF["prevailing_wage"].dropna(),
DF["wage_usd_yr"].dropna(),
DF["log_wage_usd_yr"].dropna()],
labels=["wage (raw)", "wage_usd_yr", "log(wage_usd_yr)"]
)
axes[0].set_title("Wage (raw → annualized → log1p)")
# Employees
axes[1].boxplot(
[DF["employees_raw"].dropna(),
DF["employees_clip0"].dropna(),
DF["log_employees"].dropna()],
labels=["raw", "clip0", "log1p"]
)
axes[1].set_title("Employees (raw → clip0 → log1p)")
# Company age
axes[2].boxplot(
[DF["yr_of_estab"].dropna(),
DF["company_age"].dropna(),
DF["log_company_age"].dropna()],
labels=["yr_of_estab", "age", "log1p(age)"]
)
axes[2].set_title("Company age (derivation & transform)")
plt.tight_layout()
plt.show()
# Notes for the next section
print("Notes:")
print("• Used log_wage_usd_yr for modeling; do not hard-drop wage outliers—log transform + possible winsorization is safer.")
print("• Keep employees ≥ 0 (Clipped); prefer log1p(employees_clip0) downstream.")
print("• Company age uses 2016 as a filing-year proxy; replace when a true filing year is available.")
=== OUTLIER SUMMARY (IQR, 1.5×IQR) ===
| feature | variant | n | n_outliers | pct_outliers | |
|---|---|---|---|---|---|
| 0 | wage | prevailing_wage (raw, mixed units) | 25480 | 427 | 1.68 |
| 1 | wage | wage_usd_yr (normalized) | 25480 | 2387 | 9.37 |
| 2 | wage | log_wage_usd_yr | 25480 | 2837 | 11.13 |
| 3 | employees | no_of_employees (raw) | 25480 | 1556 | 6.11 |
| 4 | employees | employees_clip0 | 25480 | 1556 | 6.11 |
| 5 | employees | log_employees | 25480 | 1915 | 7.52 |
| 6 | company_age | yr_of_estab (raw) | 25480 | 3260 | 12.79 |
| 7 | company_age | company_age | 25480 | 3196 | 12.54 |
| 8 | company_age | log_company_age | 25480 | 23 | 0.09 |
Notes: • Used log_wage_usd_yr for modeling; do not hard-drop wage outliers—log transform + possible winsorization is safer. • Keep employees ≥ 0 (Clipped); prefer log1p(employees_clip0) downstream. • Company age uses 2016 as a filing-year proxy; replace when a true filing year is available.
# Feature prep policy
# Config
WINSORIZE = True
TOP_P = 0.01
BOTTOM_P = 0.00
# Ensure they exist:
req_cols = [
"wage_usd_yr","log_wage_usd_yr",
"employees_clip0","log_employees",
"company_age","log_company_age"
]
missing_req = [c for c in req_cols if c not in DF.columns]
if missing_req:
raise KeyError(f"Missing expected columns from prior step: {missing_req}")
# Winsorization helper
def winsorize(s: pd.Series, lo=0.0, hi=0.01):
x = pd.to_numeric(s, errors="coerce")
lo_b = np.nanquantile(x, lo) if lo > 0 else None
hi_b = np.nanquantile(x, 1-hi) if hi > 0 else None
if lo_b is not None:
x = np.maximum(x, lo_b)
if hi_b is not None:
x = np.minimum(x, hi_b)
return pd.Series(x, index=s.index)
# Create winsorized copies
if WINSORIZE:
DF["wage_usd_yr_cap"] = winsorize(DF["wage_usd_yr"], lo=BOTTOM_P, hi=TOP_P)
DF["employees_clip0_cap"] = winsorize(DF["employees_clip0"], lo=0.0, hi=TOP_P)
else:
DF["wage_usd_yr_cap"] = DF["wage_usd_yr"]
DF["employees_clip0_cap"] = DF["employees_clip0"]
# Imputation global median
for col in ["wage_usd_yr_cap", "employees_clip0_cap", "company_age"]:
if DF[col].isna().any():
med = DF[col].median()
DF[col] = DF[col].fillna(med)
# Recompute logs if any imputation changed bases
DF["log_wage_usd_yr"] = np.log1p(DF["wage_usd_yr_cap"])
DF["log_employees"] = np.log1p(DF["employees_clip0_cap"])
DF["log_company_age"] = np.log1p(DF["company_age"])
# Final feature inventories numeric only
NUMERIC_AVAILABLE = [
# wage
"wage_usd_yr_cap", "log_wage_usd_yr",
# employees
"employees_clip0_cap", "log_employees",
# company age
"company_age", "log_company_age",
]
# Preferred sets by model family
FEATURES_FOR_LINEAR = [
# prefer logs for linear/logistic models
"log_wage_usd_yr", "log_employees", "log_company_age",
]
FEATURES_FOR_TREES = [
# keep both raw-ish and log as helpers
"wage_usd_yr_cap", "log_wage_usd_yr",
"employees_clip0_cap", "log_employees",
"company_age", "log_company_age",
]
# Sanity: no NaNs in chosen features
def _nan_report(cols):
rpt = DF[cols].isna().sum().sort_values(ascending=False)
has_nans = int(rpt.sum()) > 0
return has_nans, rpt
has_nans_lin, rpt_lin = _nan_report(FEATURES_FOR_LINEAR)
has_nans_tree, rpt_tree = _nan_report(FEATURES_FOR_TREES)
print("=== Feature Policy Summary ===")
print("Winsorization:", WINSORIZE, f"(top={TOP_P:.3f}, bottom={BOTTOM_P:.3f})")
print("\nNumeric features available:", NUMERIC_AVAILABLE)
print("\nLinear/Logistic preferred features:", FEATURES_FOR_LINEAR)
print("Tree/XGB features:", FEATURES_FOR_TREES)
if has_nans_lin or has_nans_tree:
print("\n[WARN] NaNs detected in selected features.")
print("\nNaN counts (Linear/Logistic):")
display(rpt_lin[rpt_lin>0])
print("\nNaN counts (Tree/XGB):")
display(rpt_tree[rpt_tree>0])
# Final safety imputer
DF[FEATURES_FOR_LINEAR] = DF[FEATURES_FOR_LINEAR].fillna(DF[FEATURES_FOR_LINEAR].median())
DF[FEATURES_FOR_TREES] = DF[FEATURES_FOR_TREES].fillna(DF[FEATURES_FOR_TREES].median())
else:
print("\n[OK] No NaNs in selected feature sets.")
# Documentation note
print("\nNote: company_age uses filing-year proxy 2016 (per dataset convention). Update if a true filing year is added.")
=== Feature Policy Summary === Winsorization: True (top=0.010, bottom=0.000) Numeric features available: ['wage_usd_yr_cap', 'log_wage_usd_yr', 'employees_clip0_cap', 'log_employees', 'company_age', 'log_company_age'] Linear/Logistic preferred features: ['log_wage_usd_yr', 'log_employees', 'log_company_age'] Tree/XGB features: ['wage_usd_yr_cap', 'log_wage_usd_yr', 'employees_clip0_cap', 'log_employees', 'company_age', 'log_company_age'] [OK] No NaNs in selected feature sets. Note: company_age uses filing-year proxy 2016 (per dataset convention). Update if a true filing year is added.
# FULL PREFLIGHT AUDIT
#Respect project seed
try:
RNG
except NameError:
RNG = 72
np.random.seed(RNG)
print("=== EASYVISA: END-TO-END PREPROCESSING AUDIT ===")
print(f"[INFO] RNG seed: {RNG}")
# Basic presence
RAW_REQ = {
"case_id","continent","education_of_employee","has_job_experience",
"requires_job_training","no_of_employees","yr_of_estab",
"region_of_employment","prevailing_wage","unit_of_wage",
"full_time_position","case_status"
}
CANON_REQ = {
"_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time",
"_case_label","y"
}
ENG_NUM_REQ = {
"wage_usd_yr", "wage_usd_yr_cap","log_wage_usd_yr",
"employees_clip0","employees_clip0_cap","log_employees",
"company_age","log_company_age"
}
CAT_ALL = ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"]
CAT_COLS = [c for c in CAT_ALL if c in DF.columns]
def _need(cols, name):
miss = sorted([c for c in cols if c not in DF.columns])
status = "OK" if not miss else f"MISSING → {miss}"
print(f"[CHECK] {name}: {status}")
if miss: raise AssertionError(f"{name} missing: {miss}")
_need(RAW_REQ, "Raw columns")
_need(CANON_REQ, "Canonical columns")
_need(ENG_NUM_REQ, "Engineered numeric columns")
print(f"[INFO] Shape: {len(DF):,} rows × {DF.shape[1]} cols\n")
# Target + class mix
vc_lbl = DF["_case_label"].value_counts()
vc_y = DF["y"].value_counts()
pos_rate = DF["y"].mean()
print("— Target balance —")
print(vc_lbl.to_string())
print(vc_y.to_string())
print(f"[OK] Positive rate (P(certified)) = {pos_rate:.4f}\n")
# EDA invariants verify-only
def _norm(s): return s.astype("string").str.strip().str.lower()
# a) Education monotonic: HS < Bach < Mast < Doc
edu_order = ["high school","bachelor's","master's","doctorate"]
edu_rates = (DF.assign(_edu=_norm(DF["_edu"]))
.groupby("_edu")["y"].mean()
.reindex(edu_order))
if edu_rates.isna().any():
print("[WARN] Education levels missing for monotonic check:", edu_rates[edu_rates.isna()].index.tolist())
else:
mono = np.all(np.diff(edu_rates.values) > 0)
print("— EDA invariant: Education monotonic (HS<Bach<Mast<Doc) —")
print(edu_rates.round(4).to_string())
print("[OK] Monotonic ↑ confirmed.\n" if mono else "[FAIL] Not monotonic!\n")
# b) Job experience: Yes > No
exp_rates = (DF.assign(_job_exp=_norm(DF["_job_exp"]))
.groupby("_job_exp")["y"].mean())
if {"y","n"}.issubset(exp_rates.index):
print("— EDA invariant: Job experience (Y > N) —")
print(exp_rates.round(4).to_string())
if not (exp_rates["y"] > exp_rates["n"]):
raise AssertionError("Job experience invariant failed.")
print("[OK] Experience improves approval.\n")
else:
print("[WARN] _job_exp not normalized to {'y','n'}; skipping.\n")
# c) Unit-of-wage broad pattern: Year > Hour
uow_rates = (DF.assign(_uow=_norm(DF["_uow"]))
.groupby("_uow")["y"].mean())
if {"year","hour"}.issubset(uow_rates.index):
print("— EDA invariant: Unit-of-wage (Year > Hour) —")
print(uow_rates.round(4).to_string())
if not (uow_rates["year"] > uow_rates["hour"]):
raise AssertionError("UoW invariant failed.")
print("[OK] Broad pattern holds.\n")
else:
print("[WARN] Missing 'year' or 'hour' in _uow; skipping.\n")
# Engineering policy audit
print("— Engineering policy —")
print("Annualization present: wage_usd_yr")
print("Log features present: log_wage_usd_yr, log_employees, log_company_age")
print("Clip/Winsor present: employees_clip0(+_cap), wage_usd_yr_cap")
print("Age proxy present: company_age (2016 proxy)\n")
# 3a) Annualization mapping sanity + impute count estimate from raw
def _map_mult(u):
if pd.isna(u): return np.nan
s = str(u).strip().lower()
if s.startswith("hour") or s in {"hr","h"}: return 2080.0
if s.startswith("week") or s in {"wk","w"}: return 52.0
if s.startswith("month") or s in {"mo","m"}: return 12.0
if s.startswith("year") or s in {"yr","y","annual"}: return 1.0
if s.startswith("bi"): return 26.0
if s.startswith("day"): return 260.0
return np.nan
annual_raw = pd.to_numeric(DF["prevailing_wage"], errors="coerce") * DF["unit_of_wage"].map(_map_mult)
annual_na = int(annual_raw.isna().sum())
cap_na = int(DF["wage_usd_yr_cap"].isna().sum())
print(f"[INFO] Annualization reconstruct: NaNs before policy ≈ {annual_na:,}")
print(f"[INFO] After policy (cap+impute): wage_usd_yr_cap NaNs = {cap_na}\n")
if cap_na != 0:
raise AssertionError("wage_usd_yr_cap still has NaNs.")
# 3b) Negatives clipped in employees
neg_emp = int(pd.to_numeric(DF["no_of_employees"], errors="coerce").lt(0).sum())
print(f"[INFO] Negative employees identified & clipped: {neg_emp:,}\n")
# 3c) Winsorization counts top caps
def _winsor_counts(base, capped):
base = pd.to_numeric(base, errors="coerce")
capped = pd.to_numeric(capped, errors="coerce")
mask_hi = capped < base # only top-capping used in our policy
return int(mask_hi.sum())
w_cap = _winsor_counts(DF["wage_usd_yr"], DF["wage_usd_yr_cap"])
e_cap = _winsor_counts(DF["employees_clip0"], DF["employees_clip0_cap"])
print("— Winsorization (top 1%) impact —")
print(f"wage_usd_yr: capped rows = {w_cap:,}")
print(f"employees_clip0: capped rows = {e_cap:,}\n")
# 3d) Imputation counts actually applied
def _impute_count(col_cap, base_na_series=None):
if base_na_series is not None:
return int(base_na_series.sum())
# fallback (if not reconstructible): report 0 if col has no NaNs now
return 0
emp_base_na = pd.to_numeric(DF["no_of_employees"], errors="coerce").isna()
age_base_na = ~pd.to_numeric(DF["yr_of_estab"], errors="coerce").between(0, 9999) # coarse
imp_wage = _impute_count("wage_usd_yr_cap", annual_raw.isna())
imp_emp = _impute_count("employees_clip0_cap", emp_base_na)
imp_age = _impute_count("company_age", age_base_na)
print("— Median imputations —")
print(f"wage_usd_yr_cap: ~{imp_wage:,} rows filled")
print(f"employees_clip0_cap: ~{imp_emp:,} rows filled")
print(f"company_age: ~{imp_age:,} rows filled\n")
# 3e) Numeric diagnostics post-policy
def _num_diag(col):
s = pd.to_numeric(DF[col], errors="coerce")
q = s.quantile([.01,.25,.5,.75,.99]).round(3)
return {
"min": float(s.min()),
"p01": float(q.loc[.01]),
"p25": float(q.loc[.25]),
"p50": float(q.loc[.5]),
"p75": float(q.loc[.75]),
"p99": float(q.loc[.99]),
"max": float(s.max())
}
NUM_CHECKS = ["wage_usd_yr","wage_usd_yr_cap","log_wage_usd_yr",
"employees_clip0","employees_clip0_cap","log_employees",
"company_age","log_company_age"]
print("— Numeric feature ranges (post-policy) —")
for c in NUM_CHECKS:
d = _num_diag(c)
print(f"{c:>20}: min={d['min']:.3g} p01={d['p01']:.3g} p50={d['p50']:.3g} p99={d['p99']:.3g} max={d['max']:.3g}")
print()
# Categorical cleaning + rare bucketing
print("— Categoricals (cleaned & rare-bucketed) —")
for c in CAT_COLS:
s = DF[c].astype("string")
nlev = int(s.nunique())
n_unknown = int((s == "unknown").sum())
n_other = int((s == f"{c}_other").sum())
top = s.value_counts().head(6).to_dict()
print(f"{c:>12}: levels={nlev} 'unknown'={n_unknown:,} '{c}_other'={n_other:,} top6={top}")
print()
# 5) Dedupe
if "case_id" in DF.columns:
dups = int(DF["case_id"].duplicated().sum())
print(f"— Dedupe —\nDuplicate case_id rows present: {dups}")
if dups != 0:
raise AssertionError("Duplicates remain on case_id.")
print("[OK] No duplicate IDs.\n")
# Feature sets + NaN guards
try:
FEATURES_FOR_LINEAR
FEATURES_FOR_TREES
except NameError:
# Fall back to the policy we defined in prep
FEATURES_FOR_LINEAR = ["log_wage_usd_yr","log_employees","log_company_age"] + CAT_COLS
FEATURES_FOR_TREES = ["wage_usd_yr_cap","log_wage_usd_yr",
"employees_clip0_cap","log_employees",
"company_age","log_company_age"] + CAT_COLS
print("— Feature sets —")
print("Linear/Logistic:", FEATURES_FOR_LINEAR)
print("Tree/XGB: ", FEATURES_FOR_TREES)
def _nan_sum(cols): return int(DF[cols].isna().sum().sum())
nan_lin = _nan_sum(FEATURES_FOR_LINEAR)
nan_tree = _nan_sum(FEATURES_FOR_TREES)
print(f"[CHECK] NaNs in Linear set: {nan_lin} | NaNs in Tree set: {nan_tree}")
if nan_lin or nan_tree:
raise AssertionError("NaNs remain in selected features.")
print("[OK] No NaNs in selected feature sets.\n")
# 7) Final readiness
print("=== PREPROCESSING AUDIT: PASS ✓ — Ready for modeling ===")
=== EASYVISA: END-TO-END PREPROCESSING AUDIT ===
[INFO] RNG seed: 72
[CHECK] Raw columns: OK
[CHECK] Canonical columns: OK
[CHECK] Engineered numeric columns: OK
[INFO] Shape: 25,480 rows × 30 cols
— Target balance —
_case_label
Certified 17018
Denied 8462
y
1 17018
0 8462
[OK] Positive rate (P(certified)) = 0.6679
— EDA invariant: Education monotonic (HS<Bach<Mast<Doc) —
_edu
high school 0.3404
bachelor's 0.6221
master's 0.7863
doctorate 0.8723
[OK] Monotonic ↑ confirmed.
— EDA invariant: Job experience (Y > N) —
_job_exp
n 0.5613
y 0.7448
[OK] Experience improves approval.
— EDA invariant: Unit-of-wage (Year > Hour) —
_uow
hour 0.3463
month 0.618
week 0.6213
year 0.6989
[OK] Broad pattern holds.
— Engineering policy —
Annualization present: wage_usd_yr
Log features present: log_wage_usd_yr, log_employees, log_company_age
Clip/Winsor present: employees_clip0(+_cap), wage_usd_yr_cap
Age proxy present: company_age (2016 proxy)
[INFO] Annualization reconstruct: NaNs before policy ≈ 0
[INFO] After policy (cap+impute): wage_usd_yr_cap NaNs = 0
[INFO] Negative employees identified & clipped: 33
— Winsorization (top 1%) impact —
wage_usd_yr: capped rows = 255
employees_clip0: capped rows = 255
— Median imputations —
wage_usd_yr_cap: ~0 rows filled
employees_clip0_cap: ~0 rows filled
company_age: ~0 rows filled
— Numeric feature ranges (post-policy) —
wage_usd_yr: min=100 p01=2.63e+03 p50=8.28e+04 p99=2.04e+06 max=1.46e+07
wage_usd_yr_cap: min=100 p01=2.63e+03 p50=8.28e+04 p99=2.04e+06 max=2.04e+06
log_wage_usd_yr: min=4.62 p01=7.88 p50=11.3 p99=14.5 max=14.5
employees_clip0: min=0 p01=52 p50=2.11e+03 p99=1.03e+05 max=6.02e+05
employees_clip0_cap: min=0 p01=52 p50=2.11e+03 p99=1.03e+05 max=1.03e+05
log_employees: min=0 p01=3.97 p50=7.65 p99=11.5 max=11.5
company_age: min=0 p01=2 p50=19 p99=178 max=199
log_company_age: min=0 p01=1.1 p50=3 p99=5.19 max=5.3
— Categoricals (cleaned & rare-bucketed) —
_edu: levels=4 'unknown'=0 '_edu_other'=0 top6={"Bachelor's": 10234, "Master's": 9634, 'High School': 3420, 'Doctorate': 2192}
_continent: levels=6 'unknown'=0 '_continent_other'=0 top6={'Asia': 16861, 'Europe': 3732, 'North America': 3292, 'South America': 852, 'Africa': 551, 'Oceania': 192}
_region: levels=5 'unknown'=0 '_region_other'=0 top6={'Northeast': 7195, 'South': 7017, 'West': 6586, 'Midwest': 4307, 'Island': 375}
_uow: levels=4 'unknown'=0 '_uow_other'=0 top6={'Year': 22962, 'Hour': 2157, 'Week': 272, 'Month': 89}
_job_exp: levels=2 'unknown'=0 '_job_exp_other'=0 top6={'Y': 14802, 'N': 10678}
_job_train: levels=2 'unknown'=0 '_job_train_other'=0 top6={'N': 22525, 'Y': 2955}
_full_time: levels=2 'unknown'=0 '_full_time_other'=0 top6={'Y': 22773, 'N': 2707}
— Dedupe —
Duplicate case_id rows present: 0
[OK] No duplicate IDs.
— Feature sets —
Linear/Logistic: ['log_wage_usd_yr', 'log_employees', 'log_company_age']
Tree/XGB: ['wage_usd_yr_cap', 'log_wage_usd_yr', 'employees_clip0_cap', 'log_employees', 'company_age', 'log_company_age']
[CHECK] NaNs in Linear set: 0 | NaNs in Tree set: 0
[OK] No NaNs in selected feature sets.
=== PREPROCESSING AUDIT: PASS ✓ — Ready for modeling ===
Model Building¶
Model Evaluation Criterion¶
- Choose the primary metric to evaluate the model on
- Elaborate on the rationale behind choosing the metric
First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.
- The
model_performance_classification_sklearnfunction will be used to check the model performance of models. - The
confusion_matrix_sklearnfunction will be used to plot the confusion matrix.
# Evaluation utilities for classification threshold-free + thresholded
from sklearn.metrics import (
roc_auc_score, average_precision_score, roc_curve, precision_recall_curve,
confusion_matrix, accuracy_score, precision_score, recall_score, f1_score,
brier_score_loss, log_loss
)
def choose_threshold(y_true, y_prob, *, method="youden", target=None):
"""
Optional helper to pick an operating threshold on validation/OOF probabilities.
method: 'youden' | 'f1' | 'precision_at' | 'recall_at'
- 'precision_at' or 'recall_at' require target in (0,1].
Returns (threshold, summary_dict).
"""
y_true = np.asarray(y_true).astype(int)
y_prob = np.asarray(y_prob).astype(float)
if method == "youden":
fpr, tpr, thr = roc_curve(y_true, y_prob)
j = tpr - fpr
t = float(thr[int(np.argmax(j))])
elif method == "f1":
grid = np.quantile(y_prob, np.linspace(0.01, 0.99, 199))
best, t = -1, 0.5
for g in grid:
pred = (y_prob >= g).astype(int)
f1 = f1_score(y_true, pred, zero_division=0)
if f1 > best:
best, t = f1, float(g)
elif method == "precision_at":
assert target is not None and 0 < target <= 1
prec, rec, thr = precision_recall_curve(y_true, y_prob)
thr = np.r_[thr, 1.0]
mask = prec >= target
t = float(thr[mask][-1]) if mask.any() else 1.0
elif method == "recall_at":
assert target is not None and 0 < target <= 1
prec, rec, thr = precision_recall_curve(y_true, y_prob)
thr = np.r_[thr, 1.0]
mask = rec >= target
t = float(thr[mask][-1]) if mask.any() else 1.0
else:
raise ValueError("Unsupported method.")
pred = (y_prob >= t).astype(int)
return t, dict(
threshold=t,
precision=precision_score(y_true, pred, zero_division=0),
recall=recall_score(y_true, pred, zero_division=0),
f1=f1_score(y_true, pred, zero_division=0),
accuracy=accuracy_score(y_true, pred),
)
def model_performance_classification_sklearn(y_true, y_prob, *, threshold=0.5):
"""
Computes threshold-free metrics (ROC-AUC, PR-AUC, KS, Brier, LogLoss) and
thresholded metrics (Accuracy, Precision, Recall, F1) at `threshold`.
Returns (metrics_dict, y_pred_binary).
"""
y_true = np.asarray(y_true).astype(int)
y_prob = np.asarray(y_prob).astype(float)
y_pred = (y_prob >= threshold).astype(int)
# Threshold-free
rocAUC = roc_auc_score(y_true, y_prob)
prAUC = average_precision_score(y_true, y_prob) # PR-AUC/AP for positive class
fpr, tpr, _ = roc_curve(y_true, y_prob)
ks = float(np.max(tpr - fpr))
brier = brier_score_loss(y_true, y_prob)
ll = log_loss(y_true, np.c_[1 - y_prob, y_prob], labels=[0, 1])
# Thresholded
out = dict(
roc_auc=rocAUC, pr_auc=prAUC, ks=ks, brier=brier, log_loss=ll,
accuracy=accuracy_score(y_true, y_pred),
precision=precision_score(y_true, y_pred, zero_division=0),
recall=recall_score(y_true, y_pred, zero_division=0),
f1=f1_score(y_true, y_pred, zero_division=0),
threshold=float(threshold),
)
return out, y_pred
def confusion_matrix_sklearn(y_true, y_prob, *, threshold, labels=("Denied","Certified")):
"""
Plots a 2x2 confusion matrix at the chosen probability threshold.
Returns (cm_array, counts_dict).
"""
y_true = np.asarray(y_true).astype(int)
y_pred = (np.asarray(y_prob) >= float(threshold)).astype(int)
cm = confusion_matrix(y_true, y_pred, labels=[0, 1])
tn, fp, fn, tp = cm.ravel()
fig, ax = plt.subplots(figsize=(4.6, 4))
ax.imshow(cm, interpolation="nearest")
ax.set_xticks([0, 1]); ax.set_yticks([0, 1])
ax.set_xticklabels(labels); ax.set_yticklabels(labels)
ax.set_xlabel("Predicted"); ax.set_ylabel("Actual")
ax.set_title(f"Confusion matrix @ threshold={threshold:.3f}")
for i in range(2):
for j in range(2):
ax.text(j, i, f"{cm[i, j]:,}", ha="center", va="center")
plt.tight_layout(); plt.show()
return cm, dict(tn=int(tn), fp=int(fp), fn=int(fn), tp=int(tp))
# Build pipelines if missing + OOF evaluation
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold
# Guards use existing objects
assert "RNG" in globals(), "RNG must be defined in your config cell."
assert "DF" in globals(), "DF not found."
assert "FEATURES_FOR_LINEAR" in globals() and "FEATURES_FOR_TREES" in globals(), "Feature lists missing."
CAT_COLS = [c for c in ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"] if c in DF.columns]
NUM_LINEAR = [c for c in FEATURES_FOR_LINEAR if c in DF.columns and c not in CAT_COLS]
NUM_TREES = [c for c in FEATURES_FOR_TREES if c in DF.columns and c not in CAT_COLS]
TARGET = "y"
y = DF[TARGET].astype(int).to_numpy()
# Define metrics helpers
if "choose_threshold" not in globals():
def choose_threshold(y_true, y_prob, *, method="youden", target=None):
y_true = np.asarray(y_true).astype(int)
y_prob = np.asarray(y_prob).astype(float)
fpr, tpr, thr_roc = roc_curve(y_true, y_prob)
if method == "youden":
j = tpr - fpr
thr = float(thr_roc[int(np.argmax(j))])
elif method == "f1":
grid = np.quantile(y_prob, np.linspace(0.01, 0.99, 199))
best, thr = -1, 0.5
for t in grid:
yp = (y_prob >= t).astype(int)
f1 = f1_score(y_true, yp, zero_division=0)
if f1 > best: best, thr = f1, float(t)
elif method == "precision_at":
assert target is not None and 0 < target <= 1
prec, rec, thr_pr = precision_recall_curve(y_true, y_prob)
thr_pr = np.r_[thr_pr, 1.0]
mask = prec >= target
thr = float(thr_pr[mask][-1]) if mask.any() else 1.0
elif method == "recall_at":
assert target is not None and 0 < target <= 1
prec, rec, thr_pr = precision_recall_curve(y_true, y_prob)
thr_pr = np.r_[thr_pr, 1.0]
mask = rec >= target
thr = float(thr_pr[mask][-1]) if mask.any() else 1.0
else:
raise ValueError("Unsupported method.")
yp = (y_prob >= thr).astype(int)
return thr, dict(
threshold=thr,
precision=precision_score(y_true, yp, zero_division=0),
recall=recall_score(y_true, yp, zero_division=0),
f1=f1_score(y_true, yp, zero_division=0),
accuracy=accuracy_score(y_true, yp),
)
if "model_performance_classification_sklearn" not in globals():
def model_performance_classification_sklearn(y_true, y_prob, *, threshold=0.5):
y_true = np.asarray(y_true).astype(int)
y_prob = np.asarray(y_prob).astype(float)
y_pred = (y_prob >= threshold).astype(int)
rocAUC = roc_auc_score(y_true, y_prob)
prAUC = average_precision_score(y_true, y_prob)
fpr, tpr, _ = roc_curve(y_true, y_prob)
ks = float(np.max(tpr - fpr))
brier = brier_score_loss(y_true, y_prob)
ll = log_loss(y_true, np.c_[1 - y_prob, y_prob], labels=[0,1])
return dict(
roc_auc=rocAUC, pr_auc=prAUC, ks=ks, brier=brier, log_loss=ll,
accuracy=accuracy_score(y_true, y_pred),
precision=precision_score(y_true, y_pred, zero_division=0),
recall=recall_score(y_true, y_pred, zero_division=0),
f1=f1_score(y_true, y_pred, zero_division=0),
threshold=threshold
), y_pred
def confusion_matrix_sklearn(y_true, y_prob, *, threshold, labels=("Denied","Certified")):
y_pred = (np.asarray(y_prob) >= float(threshold)).astype(int)
cm = confusion_matrix(np.asarray(y_true).astype(int), y_pred, labels=[0,1])
fig, ax = plt.subplots(figsize=(4.6,4))
ax.imshow(cm, interpolation="nearest", cmap="Pastel1_r")
ax.set_xticks([0,1]); ax.set_yticks([0,1])
ax.set_xticklabels(labels); ax.set_yticklabels(labels)
ax.set_xlabel("Predicted"); ax.set_ylabel("Actual")
ax.set_title(f"Confusion matrix @ threshold={threshold:.3f}")
for i in range(2):
for j in range(2):
ax.text(j, i, f"{cm[i,j]:,}", ha="center", va="center")
plt.tight_layout(); plt.show()
return cm
# Build preprocessors/pipelines ONLY if missing
if "pipe_logit" not in globals():
pp_linear = ColumnTransformer(
[("num", StandardScaler(), NUM_LINEAR),
("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), CAT_COLS)],
remainder="drop", verbose_feature_names_out=False
)
pipe_logit = Pipeline([("prep", pp_linear),
("model", LogisticRegression(max_iter=2000, solver="lbfgs", class_weight="balanced"))])
if "pipe_forest" not in globals():
pp_trees = ColumnTransformer(
[("num", "passthrough", NUM_TREES),
("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), CAT_COLS)],
remainder="drop", verbose_feature_names_out=False
)
pipe_forest = Pipeline([("prep", pp_trees),
("model", RandomForestClassifier(n_estimators=600, n_jobs=-1,
random_state=RNG,
class_weight="balanced_subsample"))])
MODELS = {
"Logit (logs+OHE+cw)": pipe_logit,
"RF (raw+log+OHE+cw)": pipe_forest,
}
# OOF evaluation ROC-AUC primary, PR-AUC secondary
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=RNG)
def oof_eval(name, pipe):
oof = np.zeros_like(y, dtype=float)
for tr, te in skf.split(DF, y):
pipe.fit(DF.iloc[tr], y[tr])
oof[te] = pipe.predict_proba(DF.iloc[te])[:, 1]
thr, thr_summ = choose_threshold(y, oof, method="youden")
perf, _ = model_performance_classification_sklearn(y, oof, threshold=thr)
row = dict(model=name, **perf, **{f"thr_{k}":v for k,v in thr_summ.items()})
return row, oof, thr
rows, oofs, thrs = [], {}, {}
for name, pipe in MODELS.items():
row, oof, thr = oof_eval(name, pipe)
rows.append(row); oofs[name] = oof; thrs[name] = thr
eval_tbl = (pd.DataFrame(rows)
.loc[:, ["model","roc_auc","pr_auc","ks","brier","log_loss",
"accuracy","precision","recall","f1","threshold"]]
.sort_values(["roc_auc","pr_auc"], ascending=False)
.reset_index(drop=True)
.round(4))
print("=== Model Selection (OOF) — ROC-AUC primary, PR-AUC secondary ===")
display(eval_tbl)
# Confusion matrices
for name in MODELS.keys():
print(f"\n--- {name} — confusion matrix @ thr={thrs[name]:.3f} ---")
confusion_matrix_sklearn(y, oofs[name], threshold=thrs[name], labels=("Denied","Certified"))
=== Model Selection (OOF) — ROC-AUC primary, PR-AUC secondary ===
| model | roc_auc | pr_auc | ks | brier | log_loss | accuracy | precision | recall | f1 | threshold | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Logit (logs+OHE+cw) | 0.7701 | 0.8649 | 0.3978 | 0.1953 | 0.5785 | 0.6901 | 0.8311 | 0.6728 | 0.7436 | 0.5234 |
| 1 | RF (raw+log+OHE+cw) | 0.7607 | 0.8628 | 0.3842 | 0.1820 | 0.5404 | 0.6969 | 0.8151 | 0.7064 | 0.7569 | 0.6633 |
--- Logit (logs+OHE+cw) — confusion matrix @ thr=0.523 ---
--- RF (raw+log+OHE+cw) — confusion matrix @ thr=0.663 ---
# Visualize BASELINES (Original): Logistic + RandomForest
# - Logistic: top coefficients (±) + odds ratios
# - RandomForest: one tree (depth=3 view) + rules + importances
from sklearn.tree import plot_tree, export_text
# 0) Pull fitted pipelines & feature names
def _extract_estimator_from_calibrator(cal):
"""Return the fitted Pipeline(prep+model) from a CalibratedClassifierCV across sklearn versions."""
# Preferred: inside calibrated_classifiers_[0]
if hasattr(cal, "calibrated_classifiers_") and cal.calibrated_classifiers_:
cc = cal.calibrated_classifiers_[0]
for attr in ("estimator", "estimator_", "base_estimator", "classifier", "clf"):
est = getattr(cc, attr, None)
if isinstance(est, Pipeline):
return est
# Fallbacks sometimes exposed on the CalibratedClassifierCV itself
for attr in ("base_estimator_", "base_estimator", "estimator"):
est = getattr(cal, attr, None)
if isinstance(est, Pipeline):
return est
return None # let caller decide how to rebuild
# Try to get the fitted pipelines from the calibrators
logit_pipe = pipe_logit
rf_pipe = pipe_forest
# If not found (older/newer sklearn), rebuild minimal equivalents for VIS ONLY
if logit_pipe is None or rf_pipe is None:
# Rebuild the same preprocessors used for baselines
CAT_COLS = [c for c in ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"] if c in DF_TR.columns]
NUM_LINEAR = [c for c in ["log_wage_usd_yr","log_employees","log_company_age"] if c in DF_TR.columns]
NUM_TREES = [c for c in ["wage_usd_yr_cap","log_wage_usd_yr","employees_clip0_cap",
"log_employees","company_age","log_company_age"] if c in DF_TR.columns]
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
pp_linear = ColumnTransformer(
[("num", StandardScaler(), NUM_LINEAR),
("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), CAT_COLS)],
remainder="drop", verbose_feature_names_out=False
)
pp_trees = ColumnTransformer(
[("num", "passthrough", NUM_TREES),
("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=True), CAT_COLS)],
remainder="drop", verbose_feature_names_out=False
)
if logit_pipe is None:
logit_pipe = Pipeline([("prep", pp_linear),
("model", LogisticRegression(max_iter=2000, solver="lbfgs",
class_weight="balanced"))]).fit(DF_TR, y_tr)
if rf_pipe is None:
rf_pipe = Pipeline([("prep", pp_trees),
("model", RandomForestClassifier(n_estimators=600, random_state=RNG,
n_jobs=-1, class_weight="balanced_subsample"))]).fit(DF_TR, y_tr)
# Now pull transformers/models/feature names
logit_prep = logit_pipe.named_steps["prep"]
logit_model = logit_pipe.named_steps["model"]
logit_feats = logit_prep.get_feature_names_out()
rf_prep = rf_pipe.named_steps["prep"]
rf_model = rf_pipe.named_steps["model"]
rf_feats = rf_prep.get_feature_names_out()
# 1) Logistic: coefficients & odds ratios
coef = logit_model.coef_.ravel()
dfc = (pd.DataFrame({"feature": logit_feats, "coef": coef})
.assign(abs_coef=lambda d: d["coef"].abs(),
odds_ratio=lambda d: np.exp(d["coef"]))
.sort_values("abs_coef", ascending=False))
topk = 15
top_pos = dfc.sort_values("coef", ascending=False).head(topk)
top_neg = dfc.sort_values("coef", ascending=True).head(topk)
viz = pd.concat([top_neg, top_pos])
plt.figure(figsize=(8, 8))
colors = np.where(viz["coef"]>=0, "#2B83F6", "#FF8C42")
plt.barh(viz["feature"], viz["coef"], color=colors)
plt.gca().invert_yaxis()
plt.title("Logistic Regression — Top ± coefficients")
plt.xlabel("Coefficient (standardized)")
plt.tight_layout()
plt.show()
print("=== Logistic — Top features with odds ratios ===")
display(viz.loc[:, ["feature","coef","odds_ratio"]].round(4))
# 2) RandomForest: visualize one tree (depth=3) + rules
tree_idx = 0
tree0 = rf_model.estimators_[tree_idx]
plt.figure(figsize=(18, 10))
plot_tree(
tree0,
feature_names=rf_feats,
class_names=["Denied","Certified"],
filled=True, rounded=True, proportion=True, fontsize=7,
max_depth=3
)
plt.title(f"RandomForest — Tree #{tree_idx} (truncated view, max_depth=3)")
plt.tight_layout()
plt.show()
print(f"=== RandomForest — Rules for Tree #{tree_idx} (max_depth=3) ===")
print(export_text(tree0, feature_names=list(rf_feats), max_depth=3))
# 3) RandomForest: feature importances (top 20)
imp = pd.Series(rf_model.feature_importances_, index=rf_feats).sort_values(ascending=False).head(20)
plt.figure(figsize=(8, 6))
plt.barh(imp.index, imp.values)
plt.gca().invert_yaxis()
plt.title("RandomForest — Top feature importances")
plt.xlabel("Importance")
plt.tight_layout()
plt.show()
=== Logistic — Top features with odds ratios ===
| feature | coef | odds_ratio | |
|---|---|---|---|
| 5 | _edu_High School | -1.6565 | 0.1908 |
| 18 | _uow_Hour | -0.7447 | 0.4749 |
| 22 | _job_exp_N | -0.5450 | 0.5798 |
| 17 | _region_West | -0.3890 | 0.6777 |
| 12 | _continent_South America | -0.3827 | 0.6820 |
| 13 | _region_Island | -0.3278 | 0.7205 |
| 10 | _continent_North America | -0.2656 | 0.7668 |
| 3 | _edu_Bachelor's | -0.2298 | 0.7947 |
| 26 | _full_time_N | -0.2057 | 0.8141 |
| 15 | _region_Northeast | -0.1928 | 0.8247 |
| 8 | _continent_Asia | -0.1501 | 0.8606 |
| 24 | _job_train_N | -0.1043 | 0.9010 |
| 11 | _continent_Oceania | -0.0742 | 0.9284 |
| 25 | _job_train_Y | -0.0479 | 0.9533 |
| 19 | _uow_Month | -0.0409 | 0.9600 |
| 4 | _edu_Doctorate | 1.0944 | 2.9873 |
| 9 | _continent_Europe | 0.6457 | 1.9072 |
| 6 | _edu_Master's | 0.6398 | 1.8961 |
| 14 | _region_Midwest | 0.5303 | 1.6994 |
| 21 | _uow_Year | 0.5295 | 1.6981 |
| 23 | _job_exp_Y | 0.3929 | 1.4813 |
| 16 | _region_South | 0.2271 | 1.2550 |
| 20 | _uow_Week | 0.1039 | 1.1095 |
| 7 | _continent_Africa | 0.0748 | 1.0777 |
| 27 | _full_time_Y | 0.0535 | 1.0550 |
| 1 | log_employees | 0.0357 | 1.0364 |
| 0 | log_wage_usd_yr | 0.0011 | 1.0011 |
| 2 | log_company_age | -0.0110 | 0.9891 |
| 19 | _uow_Month | -0.0409 | 0.9600 |
| 25 | _job_train_Y | -0.0479 | 0.9533 |
=== RandomForest — Rules for Tree #0 (max_depth=3) === |--- _edu_Doctorate <= 0.50 | |--- wage_usd_yr_cap <= 250637.73 | | |--- _region_West <= 0.50 | | | |--- _job_exp_Y <= 0.50 | | | | |--- truncated branch of depth 39 | | | |--- _job_exp_Y > 0.50 | | | | |--- truncated branch of depth 33 | | |--- _region_West > 0.50 | | | |--- _uow_Hour <= 0.50 | | | | |--- truncated branch of depth 34 | | | |--- _uow_Hour > 0.50 | | | | |--- truncated branch of depth 11 | |--- wage_usd_yr_cap > 250637.73 | | |--- _job_exp_Y <= 0.50 | | | |--- wage_usd_yr_cap <= 629553.50 | | | | |--- truncated branch of depth 23 | | | |--- wage_usd_yr_cap > 629553.50 | | | | |--- truncated branch of depth 22 | | |--- _job_exp_Y > 0.50 | | | |--- log_wage_usd_yr <= 14.39 | | | | |--- truncated branch of depth 17 | | | |--- log_wage_usd_yr > 14.39 | | | | |--- truncated branch of depth 14 |--- _edu_Doctorate > 0.50 | |--- _region_South <= 0.50 | | |--- log_wage_usd_yr <= 12.32 | | | |--- _job_exp_Y <= 0.50 | | | | |--- truncated branch of depth 18 | | | |--- _job_exp_Y > 0.50 | | | | |--- truncated branch of depth 16 | | |--- log_wage_usd_yr > 12.32 | | | |--- log_employees <= 7.40 | | | | |--- truncated branch of depth 4 | | | |--- log_employees > 7.40 | | | | |--- truncated branch of depth 10 | |--- _region_South > 0.50 | | |--- employees_clip0_cap <= 9181.50 | | | |--- log_employees <= 8.33 | | | | |--- truncated branch of depth 14 | | | |--- log_employees > 8.33 | | | | |--- class: 1.0 | | |--- employees_clip0_cap > 9181.50 | | | |--- log_employees <= 11.01 | | | | |--- truncated branch of depth 5 | | | |--- log_employees > 11.01 | | | | |--- class: 1.0
# Freeze final choice per criterion
if "BEST_MODEL_NAME" not in globals():
BEST_MODEL_NAME = "Logit (logs+OHE+cw)"
if "BEST_PIPE" not in globals():
BEST_PIPE = pipe_logit
if "BEST_THRESHOLD" not in globals():
BEST_THRESHOLD = thrs[BEST_MODEL_NAME]
if "BEST_OOF_PROB" not in globals():
BEST_OOF_PROB = oofs[BEST_MODEL_NAME]
y_true = DF["y"].astype(int).to_numpy()
# Robust threshold picker
def pick_threshold(y_true, y_prob, *, method="youden", target=None):
y_true = np.asarray(y_true).astype(int)
y_prob = np.asarray(y_prob).astype(float)
if method == "youden":
fpr, tpr, thr = roc_curve(y_true, y_prob)
j = tpr - fpr
t = float(thr[int(np.argmax(j))])
elif method == "f1":
grid = np.quantile(y_prob, np.linspace(0.01, 0.99, 199))
f1s = [f1_score(y_true, (y_prob >= g).astype(int), zero_division=0) for g in grid]
t = float(grid[int(np.argmax(f1s))])
elif method in {"precision_at", "recall_at"}:
assert target is not None and 0 < target <= 1, "target must be in (0,1]"
prec, rec, thr = precision_recall_curve(y_true, y_prob)
# NOTE: `thr` has length len(prec)-1 == len(rec)-1 and aligns with prec[1:], rec[1:]
prec_v, rec_v, thr_v = prec[1:], rec[1:], thr
if method == "precision_at":
ok = np.where(prec_v >= target)[0]
else: # recall_at
ok = np.where(rec_v >= target)[0]
if ok.size:
# choose the HIGHEST threshold that still meets the target
t = float(thr_v[ok].max())
else:
# fallback: use F1-optimal if target is unattainable
grid = np.quantile(y_prob, np.linspace(0.01, 0.99, 199))
f1s = [f1_score(y_true, (y_prob >= g).astype(int), zero_division=0) for g in grid]
t = float(grid[int(np.argmax(f1s))])
else:
raise ValueError("Unsupported method.")
return t
# 2) Compute operating points
t_you = float(BEST_THRESHOLD) # default (Youden from OOF)
t80 = pick_threshold(y_true, BEST_OOF_PROB, method="precision_at", target=0.80)
t90 = pick_threshold(y_true, BEST_OOF_PROB, method="recall_at", target=0.90)
# 3) Metrics table at each threshold
def eval_at(th):
perf, _ = model_performance_classification_sklearn(y_true, BEST_OOF_PROB, threshold=th)
return pd.Series(perf).round(4)
tbl = pd.DataFrame(
{"label": ["Youden (default)", "Precision≥0.80", "Recall≥0.90"],
"threshold": [t_you, t80, t90]}
)
report = pd.concat([tbl, tbl["threshold"].apply(eval_at)], axis=1)
print(f"=== {BEST_MODEL_NAME} — operating points ===")
display(report)
# 4) Confusion matrices
print("\nConfusion @ Youden:"); confusion_matrix_sklearn(y_true, BEST_OOF_PROB, threshold=t_you, labels=("Denied","Certified"))
print("\nConfusion @ Prec≥0.80:"); confusion_matrix_sklearn(y_true, BEST_OOF_PROB, threshold=t80, labels=("Denied","Certified"))
print("\nConfusion @ Rec≥0.90:"); confusion_matrix_sklearn(y_true, BEST_OOF_PROB, threshold=t90, labels=("Denied","Certified"))
=== Logit (logs+OHE+cw) — operating points ===
| label | threshold | roc_auc | pr_auc | ks | brier | log_loss | accuracy | precision | recall | f1 | threshold | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Youden (default) | 0.523426 | 0.7701 | 0.8649 | 0.3978 | 0.1953 | 0.5785 | 0.6901 | 0.8311 | 0.6728 | 0.7436 | 0.5234 |
| 1 | Precision≥0.80 | 0.958036 | 0.7701 | 0.8649 | 0.3978 | 0.1953 | 0.5785 | 0.3321 | 1.0000 | 0.0001 | 0.0001 | 0.9580 |
| 2 | Recall≥0.90 | 0.300785 | 0.7701 | 0.8649 | 0.3978 | 0.1953 | 0.5785 | 0.7343 | 0.7513 | 0.9001 | 0.8190 | 0.3008 |
Confusion @ Youden:
Confusion @ Prec≥0.80:
Confusion @ Rec≥0.90:
array([[ 3392, 5070],
[ 1700, 15318]])
# Use the fitted RandomForest pipeline (rf_pipe) and model (rf_model)
rf_model = rf_pipe.named_steps["model"]
rf_feats = rf_pipe.named_steps["prep"].get_feature_names_out()
# Thresholds for each operating point
thr_you = t_you
thr_prec = t80
thr_rec = t90
for label, thr in [("Youden (default)", thr_you), ("Precision≥0.80", thr_prec), ("Recall≥0.90", thr_rec)]:
print(f"\n=== RandomForest Visualization @ {label} (threshold={thr:.4f}) ===")
# 1) Visualize one tree (depth=3 view)
tree_idx = 0
tree0 = rf_model.estimators_[tree_idx]
plt.figure(figsize=(18, 10))
plot_tree(
tree0,
feature_names=rf_feats,
class_names=["Denied", "Certified"],
filled=True, rounded=True, proportion=True, fontsize=7,
max_depth=3
)
plt.title(f"RandomForest — Tree #{tree_idx} (max_depth=3) @ {label}")
plt.tight_layout()
plt.show()
# 2) Print rules for the tree (max_depth=3)
print(f"=== RandomForest — Rules for Tree #{tree_idx} (max_depth=3) @ {label} ===")
print(export_text(tree0, feature_names=list(rf_feats), max_depth=3))
# 3) Feature importances (top 20)
imp = pd.Series(rf_model.feature_importances_, index=rf_feats).sort_values(ascending=False).head(20)
plt.figure(figsize=(8, 6))
plt.barh(imp.index, imp.values)
plt.gca().invert_yaxis()
plt.title(f"RandomForest — Top feature importances @ {label}")
plt.xlabel("Importance")
plt.tight_layout()
plt.show()
=== RandomForest Visualization @ Youden (default) (threshold=0.5234) ===
=== RandomForest — Rules for Tree #0 (max_depth=3) @ Youden (default) === |--- _edu_Doctorate <= 0.50 | |--- wage_usd_yr_cap <= 250637.73 | | |--- _region_West <= 0.50 | | | |--- _job_exp_Y <= 0.50 | | | | |--- truncated branch of depth 39 | | | |--- _job_exp_Y > 0.50 | | | | |--- truncated branch of depth 33 | | |--- _region_West > 0.50 | | | |--- _uow_Hour <= 0.50 | | | | |--- truncated branch of depth 34 | | | |--- _uow_Hour > 0.50 | | | | |--- truncated branch of depth 11 | |--- wage_usd_yr_cap > 250637.73 | | |--- _job_exp_Y <= 0.50 | | | |--- wage_usd_yr_cap <= 629553.50 | | | | |--- truncated branch of depth 23 | | | |--- wage_usd_yr_cap > 629553.50 | | | | |--- truncated branch of depth 22 | | |--- _job_exp_Y > 0.50 | | | |--- log_wage_usd_yr <= 14.39 | | | | |--- truncated branch of depth 17 | | | |--- log_wage_usd_yr > 14.39 | | | | |--- truncated branch of depth 14 |--- _edu_Doctorate > 0.50 | |--- _region_South <= 0.50 | | |--- log_wage_usd_yr <= 12.32 | | | |--- _job_exp_Y <= 0.50 | | | | |--- truncated branch of depth 18 | | | |--- _job_exp_Y > 0.50 | | | | |--- truncated branch of depth 16 | | |--- log_wage_usd_yr > 12.32 | | | |--- log_employees <= 7.40 | | | | |--- truncated branch of depth 4 | | | |--- log_employees > 7.40 | | | | |--- truncated branch of depth 10 | |--- _region_South > 0.50 | | |--- employees_clip0_cap <= 9181.50 | | | |--- log_employees <= 8.33 | | | | |--- truncated branch of depth 14 | | | |--- log_employees > 8.33 | | | | |--- class: 1.0 | | |--- employees_clip0_cap > 9181.50 | | | |--- log_employees <= 11.01 | | | | |--- truncated branch of depth 5 | | | |--- log_employees > 11.01 | | | | |--- class: 1.0
=== RandomForest Visualization @ Precision≥0.80 (threshold=0.9580) ===
=== RandomForest — Rules for Tree #0 (max_depth=3) @ Precision≥0.80 === |--- _edu_Doctorate <= 0.50 | |--- wage_usd_yr_cap <= 250637.73 | | |--- _region_West <= 0.50 | | | |--- _job_exp_Y <= 0.50 | | | | |--- truncated branch of depth 39 | | | |--- _job_exp_Y > 0.50 | | | | |--- truncated branch of depth 33 | | |--- _region_West > 0.50 | | | |--- _uow_Hour <= 0.50 | | | | |--- truncated branch of depth 34 | | | |--- _uow_Hour > 0.50 | | | | |--- truncated branch of depth 11 | |--- wage_usd_yr_cap > 250637.73 | | |--- _job_exp_Y <= 0.50 | | | |--- wage_usd_yr_cap <= 629553.50 | | | | |--- truncated branch of depth 23 | | | |--- wage_usd_yr_cap > 629553.50 | | | | |--- truncated branch of depth 22 | | |--- _job_exp_Y > 0.50 | | | |--- log_wage_usd_yr <= 14.39 | | | | |--- truncated branch of depth 17 | | | |--- log_wage_usd_yr > 14.39 | | | | |--- truncated branch of depth 14 |--- _edu_Doctorate > 0.50 | |--- _region_South <= 0.50 | | |--- log_wage_usd_yr <= 12.32 | | | |--- _job_exp_Y <= 0.50 | | | | |--- truncated branch of depth 18 | | | |--- _job_exp_Y > 0.50 | | | | |--- truncated branch of depth 16 | | |--- log_wage_usd_yr > 12.32 | | | |--- log_employees <= 7.40 | | | | |--- truncated branch of depth 4 | | | |--- log_employees > 7.40 | | | | |--- truncated branch of depth 10 | |--- _region_South > 0.50 | | |--- employees_clip0_cap <= 9181.50 | | | |--- log_employees <= 8.33 | | | | |--- truncated branch of depth 14 | | | |--- log_employees > 8.33 | | | | |--- class: 1.0 | | |--- employees_clip0_cap > 9181.50 | | | |--- log_employees <= 11.01 | | | | |--- truncated branch of depth 5 | | | |--- log_employees > 11.01 | | | | |--- class: 1.0
=== RandomForest Visualization @ Recall≥0.90 (threshold=0.3008) ===
=== RandomForest — Rules for Tree #0 (max_depth=3) @ Recall≥0.90 === |--- _edu_Doctorate <= 0.50 | |--- wage_usd_yr_cap <= 250637.73 | | |--- _region_West <= 0.50 | | | |--- _job_exp_Y <= 0.50 | | | | |--- truncated branch of depth 39 | | | |--- _job_exp_Y > 0.50 | | | | |--- truncated branch of depth 33 | | |--- _region_West > 0.50 | | | |--- _uow_Hour <= 0.50 | | | | |--- truncated branch of depth 34 | | | |--- _uow_Hour > 0.50 | | | | |--- truncated branch of depth 11 | |--- wage_usd_yr_cap > 250637.73 | | |--- _job_exp_Y <= 0.50 | | | |--- wage_usd_yr_cap <= 629553.50 | | | | |--- truncated branch of depth 23 | | | |--- wage_usd_yr_cap > 629553.50 | | | | |--- truncated branch of depth 22 | | |--- _job_exp_Y > 0.50 | | | |--- log_wage_usd_yr <= 14.39 | | | | |--- truncated branch of depth 17 | | | |--- log_wage_usd_yr > 14.39 | | | | |--- truncated branch of depth 14 |--- _edu_Doctorate > 0.50 | |--- _region_South <= 0.50 | | |--- log_wage_usd_yr <= 12.32 | | | |--- _job_exp_Y <= 0.50 | | | | |--- truncated branch of depth 18 | | | |--- _job_exp_Y > 0.50 | | | | |--- truncated branch of depth 16 | | |--- log_wage_usd_yr > 12.32 | | | |--- log_employees <= 7.40 | | | | |--- truncated branch of depth 4 | | | |--- log_employees > 7.40 | | | | |--- truncated branch of depth 10 | |--- _region_South > 0.50 | | |--- employees_clip0_cap <= 9181.50 | | | |--- log_employees <= 8.33 | | | | |--- truncated branch of depth 14 | | | |--- log_employees > 8.33 | | | | |--- class: 1.0 | | |--- employees_clip0_cap > 9181.50 | | | |--- log_employees <= 11.01 | | | | |--- truncated branch of depth 5 | | | |--- log_employees > 11.01 | | | | |--- class: 1.0
# A) Train/Hold-out, calibration, policy threshold, hold-out evaluation
from sklearn.model_selection import train_test_split
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import roc_curve, precision_recall_curve
assert "RNG" in globals() and "DF" in globals(), "Need RNG + DF in memory"
assert "BEST_PIPE" in globals() and "BEST_MODEL_NAME" in globals(), "Run selection first"
# 1) Stratified hold-out split (single final test set)
idx = np.arange(len(DF))
idx_tr, idx_ho = train_test_split(idx, test_size=0.20, stratify=DF["y"], random_state=RNG)
DF_TR, DF_HO = DF.iloc[idx_tr].copy(), DF.iloc[idx_ho].copy()
y_tr, y_ho = DF_TR["y"].astype(int).to_numpy(), DF_HO["y"].astype(int).to_numpy()
print(f"[Split] Train={len(DF_TR):,} Hold-out={len(DF_HO):,}")
# 2) Fit calibrated model on TRAIN (sigmoid/Platt; 5-fold inside train only)
cal = CalibratedClassifierCV(BEST_PIPE, method="sigmoid", cv=5)
cal.fit(DF_TR, y_tr)
# 3) Production threshold policy (choose one)
POLICY = "precision_floor" # options: "precision_floor", "youden", "recall_floor"
PRECISION_FLOOR = 0.80 # only used if POLICY == "precision_floor"
RECALL_FLOOR = 0.90 # only used if POLICY == "recall_floor"
# OOF-like selection inside TRAIN: use CV predict_proba on TRAIN via CalibratedClassifierCV
proba_tr = cal.predict_proba(DF_TR)[:, 1]
def pick_threshold(y_true, y_prob, *, method="youden", target=None):
y_true = np.asarray(y_true).astype(int)
y_prob = np.asarray(y_prob).astype(float)
if method == "youden":
fpr, tpr, thr = roc_curve(y_true, y_prob)
return float(thr[int(np.argmax(tpr - fpr))])
elif method == "precision_at":
prec, rec, thr = precision_recall_curve(y_true, y_prob)
prec_v, thr_v = prec[1:], thr
ok = np.where(prec_v >= float(target))[0]
if ok.size:
# best F1 among feasible thresholds (robust)
cands = thr_v[ok]
from sklearn.metrics import f1_score
f1s = [f1_score(y_true, (y_prob >= t0).astype(int), zero_division=0) for t0 in cands]
return float(cands[int(np.argmax(f1s))])
# fallback: Youden if unattainable
fpr, tpr, thr = roc_curve(y_true, y_prob)
return float(thr[int(np.argmax(tpr - fpr))])
elif method == "recall_at":
prec, rec, thr = precision_recall_curve(y_true, y_prob)
rec_v, thr_v = rec[1:], thr
ok = np.where(rec_v >= float(target))[0]
if ok.size:
# highest threshold that still satisfies recall floor
return float(thr_v[ok].max())
# fallback: minimal threshold
return float(thr_v.min() if thr_v.size else 0.0)
else:
raise ValueError("Unsupported method")
if POLICY == "precision_floor":
THR_PROD = pick_threshold(y_tr, proba_tr, method="precision_at", target=PRECISION_FLOOR)
elif POLICY == "recall_floor":
THR_PROD = pick_threshold(y_tr, proba_tr, method="recall_at", target=RECALL_FLOOR)
else:
THR_PROD = pick_threshold(y_tr, proba_tr, method="youden")
print(f"[Policy] {POLICY} → production threshold = {THR_PROD:.3f}")
# 4) Hold-out evaluation calibrated
proba_ho = cal.predict_proba(DF_HO)[:, 1]
perf_ho, _ = model_performance_classification_sklearn(y_ho, proba_ho, threshold=THR_PROD)
print("\n=== HOLD-OUT METRICS (calibrated) ===")
display(pd.Series(perf_ho).round(4))
print("\nConfusion @ production threshold:")
confusion_matrix_sklearn(y_ho, proba_ho, threshold=THR_PROD, labels=("Denied","Certified"))
[Split] Train=20,384 Hold-out=5,096 [Policy] precision_floor → production threshold = 0.641 === HOLD-OUT METRICS (calibrated) ===
roc_auc 0.7630 pr_auc 0.8600 ks 0.3953 brier 0.1807 log_loss 0.5389 accuracy 0.7121 precision 0.7952 recall 0.7665 f1 0.7806 threshold 0.6406 dtype: float64
Confusion @ production threshold:
array([[1020, 672],
[ 795, 2609]])
from sklearn.tree import plot_tree, export_text
# Use the fitted RandomForest pipeline (rf_pipe) and model (rf_model)
rf_model = rf_pipe.named_steps["model"]
rf_feats = rf_pipe.named_steps["prep"].get_feature_names_out()
# Visualize one tree (e.g., tree index 0, depth=3)
tree_idx = 0
tree0 = rf_model.estimators_[tree_idx]
plt.figure(figsize=(18, 10))
plot_tree(
tree0,
feature_names=rf_feats,
class_names=["Denied", "Certified"],
filled=True, rounded=True, proportion=True, fontsize=7,
max_depth=3
)
plt.title(f"RandomForest — Tree #{tree_idx} (max_depth=3) @ threshold=0.641 (precision_floor)")
plt.tight_layout()
plt.show()
importances = pd.Series(rf_model.feature_importances_, index=rf_feats)
top_importances = importances.sort_values(ascending=False).head(20)
plt.figure(figsize=(8, 6))
plt.barh(top_importances.index, top_importances.values)
plt.gca().invert_yaxis()
plt.title("RandomForest — Top 20 Feature Importances")
plt.xlabel("Importance")
plt.tight_layout()
plt.show()
importances = pd.Series(rf_model.feature_importances_, index=rf_feats)
top_importances = importances.sort_values(ascending=False).head(20)
# Display tree rules (text format, depth=3)
print(f"=== RandomForest — Rules for Tree #{tree_idx} (max_depth=3) @ threshold=0.641 (precision_floor) ===")
print(export_text(tree0, feature_names=list(rf_feats), max_depth=3))
=== RandomForest — Rules for Tree #0 (max_depth=3) @ threshold=0.641 (precision_floor) === |--- _edu_Doctorate <= 0.50 | |--- wage_usd_yr_cap <= 250637.73 | | |--- _region_West <= 0.50 | | | |--- _job_exp_Y <= 0.50 | | | | |--- truncated branch of depth 39 | | | |--- _job_exp_Y > 0.50 | | | | |--- truncated branch of depth 33 | | |--- _region_West > 0.50 | | | |--- _uow_Hour <= 0.50 | | | | |--- truncated branch of depth 34 | | | |--- _uow_Hour > 0.50 | | | | |--- truncated branch of depth 11 | |--- wage_usd_yr_cap > 250637.73 | | |--- _job_exp_Y <= 0.50 | | | |--- wage_usd_yr_cap <= 629553.50 | | | | |--- truncated branch of depth 23 | | | |--- wage_usd_yr_cap > 629553.50 | | | | |--- truncated branch of depth 22 | | |--- _job_exp_Y > 0.50 | | | |--- log_wage_usd_yr <= 14.39 | | | | |--- truncated branch of depth 17 | | | |--- log_wage_usd_yr > 14.39 | | | | |--- truncated branch of depth 14 |--- _edu_Doctorate > 0.50 | |--- _region_South <= 0.50 | | |--- log_wage_usd_yr <= 12.32 | | | |--- _job_exp_Y <= 0.50 | | | | |--- truncated branch of depth 18 | | | |--- _job_exp_Y > 0.50 | | | | |--- truncated branch of depth 16 | | |--- log_wage_usd_yr > 12.32 | | | |--- log_employees <= 7.40 | | | | |--- truncated branch of depth 4 | | | |--- log_employees > 7.40 | | | | |--- truncated branch of depth 10 | |--- _region_South > 0.50 | | |--- employees_clip0_cap <= 9181.50 | | | |--- log_employees <= 8.33 | | | | |--- truncated branch of depth 14 | | | |--- log_employees > 8.33 | | | | |--- class: 1.0 | | |--- employees_clip0_cap > 9181.50 | | | |--- log_employees <= 11.01 | | | | |--- truncated branch of depth 5 | | | |--- log_employees > 11.01 | | | | |--- class: 1.0
# B) Curves: ROC, PR, reliability (calibration) on hold-out
import numpy as np, matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score
# ROC
fpr, tpr, _ = roc_curve(y_ho, proba_ho)
rocAUC = auc(fpr, tpr)
plt.figure(figsize=(5.2, 4))
plt.plot(fpr, tpr, label=f"ROC-AUC = {rocAUC:.3f}")
plt.plot([0,1],[0,1],"--",lw=1,alpha=.5)
plt.xlabel("FPR"); plt.ylabel("TPR"); plt.title("Hold-out ROC"); plt.legend(); plt.grid(alpha=.25); plt.tight_layout(); plt.show()
# PR
prec, rec, _ = precision_recall_curve(y_ho, proba_ho)
prAUC = average_precision_score(y_ho, proba_ho)
plt.figure(figsize=(5.2, 4))
plt.plot(rec, prec, label=f"PR-AUC (AP) = {prAUC:.3f}")
plt.xlabel("Recall"); plt.ylabel("Precision"); plt.title("Hold-out PR curve")
plt.legend(); plt.grid(alpha=.25); plt.tight_layout(); plt.show()
# Reliability / calibration curve
prob_true, prob_pred = calibration_curve(y_ho, proba_ho, n_bins=10, strategy="quantile")
plt.figure(figsize=(5.2, 4))
plt.plot(prob_pred, prob_true, marker="o", label="Calibrated")
plt.plot([0,1],[0,1],"--",lw=1,alpha=.5)
plt.xlabel("Predicted probability"); plt.ylabel("Observed frequency")
plt.title("Hold-out reliability curve"); plt.legend(); plt.grid(alpha=.25); plt.tight_layout(); plt.show()
# C) Slice performance, artifact saving, model card
import json, joblib, pathlib
def slice_report(df, proba, y_true, col, thr):
rows = []
for val, idx in df.groupby(col).indices.items():
yy = y_true[idx]; pp = proba[idx]
perf, _ = model_performance_classification_sklearn(yy, pp, threshold=thr)
perf.update(dict(slice=str(col), value=str(val), n=int(len(idx))))
rows.append(perf)
return (pd.DataFrame(rows)
.sort_values("n", ascending=False)
.loc[:, ["slice","value","n","roc_auc","pr_auc","precision","recall","f1","brier","log_loss"]]
.round(4))
print("=== Slice metrics (hold-out) ===")
for c in ["_continent","_region","_edu"]:
if c in DF_HO.columns:
print(f"\n-- {c} --")
display(slice_report(DF_HO.reset_index(drop=True), proba_ho, y_ho, c, THR_PROD))
# Save artifacts toggle
SAVE = True
if SAVE:
pathlib.Path("models").mkdir(exist_ok=True)
pathlib.Path("reports").mkdir(exist_ok=True)
joblib.dump(cal, "models/easyvisa_logit_calibrated.pkl")
json.dump({"threshold": float(THR_PROD), "rng": int(RNG), "policy": POLICY},
open("models/threshold.json","w"))
json.dump({k: float(v) if isinstance(v, (int,float,np.floating)) else v for k,v in perf_ho.items()},
open("reports/metrics_holdout.json","w"))
print("[Saved] models/easyvisa_logit_calibrated.pkl, models/threshold.json, reports/metrics_holdout.json")
# Minimal model card
MODEL_CARD = {
"Model": BEST_MODEL_NAME + " + sigmoid calibration",
"Purpose": "Predict likelihood of visa certification.",
"Data": {"rows": int(len(DF)), "features_numeric": ["log_wage_usd_yr","log_employees","log_company_age"],
"categoricals": ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"]},
"Selection_metric": "ROC-AUC (primary), PR-AUC (secondary) via 5-fold OOF on train",
"Threshold_policy": {"policy": POLICY, "precision_floor": PRECISION_FLOOR, "recall_floor": RECALL_FLOOR,
"threshold_production": float(THR_PROD)},
"Holdout_metrics": {k: float(v) if isinstance(v,(int,float,np.floating)) else v for k,v in perf_ho.items()},
"Calibration": "Platt (sigmoid), 5-fold on TRAIN; evaluated on HOLD-OUT",
"Limitations": ["Historical bias possible across regions/education; monitor slice metrics monthly.",
"Probability drift expected if wage/employee distributions shift; monitor PSI."],
"Retrain_triggers": ["PSI > 0.2 on any key feature", "Hold-out ROC-AUC drop > 0.03 vs baseline",
"Policy threshold no longer meets precision/recall floor on recent data"]
}
print("\n=== MODEL CARD (summary) ===")
display(pd.Series(MODEL_CARD))
=== Slice metrics (hold-out) === -- _continent --
| slice | value | n | roc_auc | pr_auc | precision | recall | f1 | brier | log_loss | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | _continent | Asia | 3325 | 0.7622 | 0.8539 | 0.7837 | 0.7576 | 0.7704 | 0.1839 | 0.5460 |
| 2 | _continent | Europe | 739 | 0.7628 | 0.9145 | 0.8727 | 0.8861 | 0.8793 | 0.1344 | 0.4280 |
| 3 | _continent | North America | 703 | 0.7102 | 0.7888 | 0.7321 | 0.6721 | 0.7009 | 0.2118 | 0.6149 |
| 5 | _continent | South America | 182 | 0.7055 | 0.8005 | 0.7684 | 0.6348 | 0.6952 | 0.2119 | 0.6183 |
| 0 | _continent | Africa | 103 | 0.8327 | 0.9395 | 0.8767 | 0.8312 | 0.8533 | 0.1426 | 0.4431 |
| 4 | _continent | Oceania | 44 | 0.6873 | 0.8269 | 0.7667 | 0.7419 | 0.7541 | 0.1829 | 0.5448 |
-- _region --
| slice | value | n | roc_auc | pr_auc | precision | recall | f1 | brier | log_loss | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2 | _region | Northeast | 1413 | 0.7902 | 0.8500 | 0.8002 | 0.7744 | 0.7871 | 0.1757 | 0.5318 |
| 3 | _region | South | 1360 | 0.7441 | 0.8611 | 0.7916 | 0.7711 | 0.7812 | 0.1861 | 0.5494 |
| 4 | _region | West | 1338 | 0.7353 | 0.8183 | 0.7578 | 0.6945 | 0.7248 | 0.1980 | 0.5802 |
| 1 | _region | Midwest | 906 | 0.7596 | 0.9088 | 0.8381 | 0.8538 | 0.8459 | 0.1532 | 0.4697 |
| 0 | _region | Island | 79 | 0.7129 | 0.8310 | 0.7222 | 0.5306 | 0.6118 | 0.1999 | 0.5809 |
-- _edu --
| slice | value | n | roc_auc | pr_auc | precision | recall | f1 | brier | log_loss | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | _edu | Bachelor's | 2044 | 0.6876 | 0.7490 | 0.7292 | 0.6743 | 0.7007 | 0.2098 | 0.6099 |
| 3 | _edu | Master's | 1955 | 0.7412 | 0.9051 | 0.8195 | 0.9154 | 0.8648 | 0.1471 | 0.4578 |
| 2 | _edu | High School | 702 | 0.5462 | 0.3993 | 0.5000 | 0.0315 | 0.0593 | 0.2399 | 0.6821 |
| 1 | _edu | Doctorate | 395 | 0.7252 | 0.9473 | 0.9003 | 0.9772 | 0.9372 | 0.0916 | 0.3183 |
[Saved] models/easyvisa_logit_calibrated.pkl, models/threshold.json, reports/metrics_holdout.json === MODEL CARD (summary) ===
Model Logit (logs+OHE+cw) + sigmoid calibration
Purpose Predict likelihood of visa certification.
Data {'rows': 25480, 'features_numeric': ['log_wage...
Selection_metric ROC-AUC (primary), PR-AUC (secondary) via 5-fo...
Threshold_policy {'policy': 'precision_floor', 'precision_floor...
Holdout_metrics {'roc_auc': 0.7630193792312201, 'pr_auc': 0.86...
Calibration Platt (sigmoid), 5-fold on TRAIN; evaluated on...
Limitations [Historical bias possible across regions/educa...
Retrain_triggers [PSI > 0.2 on any key feature, Hold-out ROC-AU...
dtype: object
Defining scorer to be used for cross-validation and hyperparameter tuning¶
Scoring models with ROC-AUC as the primary metric and PR-AUC (Average Precision) as the secondary. Both are threshold-free and work directly on predicted probabilities, which fits the pipeline picking the operating threshold later by policy—precision ≥ 0.80—after calibration
ROC-AUC measures ranking quality across all thresholds and stays stable under ~2:1 class skew it is the cleanest way to compare models during CV
PR-AUC focuses on the Certified classes precision/recall trade-off when positive class utility matters PR-AUC tells how efficiently models can harvest true positives without flooding false positives
Accuracy can look fine while missing the point under imbalance F1 depends on a specific threshold pushing threshold decisions to a separate policy step Youden or precision floor not into hyper-tuning
Refit on ROC-AUC, but also track PR-AUC, Brier, and LogLoss so we can see ranking + probability quality calibration is applied after picking hyper-params to avoid leakage, and the operating threshold is chosen on the calibrated TRAIN probabilities by policy precision ≥ 0.80 then evaluated on the HOLD-OUT
We are now done with pre-processing and evaluation criterion, so let's start building the model.
Model building with Original data¶
# Model Building — ORIGINAL data
# DT, RF, Bagging, AdaBoost, HistGradientBoosting
# Calibrate on TRAIN → pick threshold (precision ≥0.80 on TRAIN) → Evaluate on HOLD-OUT
import numpy as np, pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import precision_recall_curve, roc_curve, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, HistGradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
# --- Guards
assert "RNG" in globals() and "DF_TR" in globals() and "DF_HO" in globals(), "Run split first."
assert "model_performance_classification_sklearn" in globals() and "confusion_matrix_sklearn" in globals()
y_tr = DF_TR["y"].astype(int).to_numpy()
y_ho = DF_HO["y"].astype(int).to_numpy()
# --- Precision ≥ floor on TRAIN (fallback Youden)
def _pick_thr_precision_floor(y_true, y_prob, p_floor=0.80):
y_true = np.asarray(y_true).astype(int); y_prob = np.asarray(y_prob).astype(float)
prec, rec, thr = precision_recall_curve(y_true, y_prob)
ok = np.where(prec[1:] >= p_floor)[0]
if ok.size:
cands = thr[ok]
f1s = [f1_score(y_true, (y_prob >= t).astype(int), zero_division=0) for t in cands]
return float(cands[int(np.argmax(f1s))])
fpr, tpr, thr = roc_curve(y_true, y_prob)
return float(thr[int(np.argmax(tpr - fpr))])
# --- Tree-style prep: passthrough numeric + OHE categorical (same as RF you used)
CAT = [c for c in ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"] if c in DF_TR.columns]
NUM = [c for c in ["wage_usd_yr_cap","log_wage_usd_yr","employees_clip0_cap","log_employees","company_age","log_company_age"] if c in DF_TR.columns]
pp_trees = ColumnTransformer(
[("num", "passthrough", NUM),
("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=True), CAT)],
remainder="drop", verbose_feature_names_out=False
)
# --- 5 baseline models (class_weight on ORIGINAL to respect skew)
DT = Pipeline([("prep", pp_trees),
("model", DecisionTreeClassifier(max_depth=6, class_weight="balanced", random_state=RNG))])
RF = Pipeline([("prep", pp_trees),
("model", RandomForestClassifier(n_estimators=600, random_state=RNG,
n_jobs=-1, class_weight="balanced_subsample"))])
BAG = Pipeline([("prep", pp_trees),
("model", BaggingClassifier(
estimator=DecisionTreeClassifier(max_depth=6, class_weight="balanced", random_state=RNG),
n_estimators=200, n_jobs=-1, random_state=RNG
))])
ADA = Pipeline([("prep", pp_trees),
("model", AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1, class_weight="balanced", random_state=RNG),
n_estimators=100, learning_rate=0.5, random_state=RNG
))])
HGB = Pipeline([("prep", pp_trees),
("model", HistGradientBoostingClassifier(
max_depth=None, learning_rate=0.06, max_iter=600,
l2_regularization=0.1, class_weight="balanced", random_state=RNG
))])
MODEL_ZOO = {
"DecisionTree": DT,
"RandomForest": RF,
"Bagging": BAG,
"AdaBoost": ADA,
"HistGB": HGB,
}
# --- Train→Calibrate→Threshold (TRAIN) → Evaluate (HOLD-OUT)
rows, probs, thrs = [], {}, {}
for name, pipe in MODEL_ZOO.items():
cal = CalibratedClassifierCV(pipe, method="sigmoid", cv=5).fit(DF_TR, y_tr)
thr = _pick_thr_precision_floor(y_tr, cal.predict_proba(DF_TR)[:,1], p_floor=0.80)
proba_ho = cal.predict_proba(DF_HO)[:,1]
perf, _ = model_performance_classification_sklearn(y_ho, proba_ho, threshold=thr)
perf["model"] = name
rows.append(perf); probs[name] = proba_ho; thrs[name] = thr
ORIG_5MODELS = (pd.DataFrame(rows)
.loc[:, ["model","roc_auc","pr_auc","ks","brier","log_loss","accuracy","precision","recall","f1","threshold"]]
.sort_values(["roc_auc","pr_auc"], ascending=False)
.reset_index(drop=True).round(4))
print("=== ORIGINAL data — 5 classification models (calibrated, precision≥0.80 policy) ===")
display(ORIG_5MODELS)
# --- Confusion matrices for the top 2 by ROC-AUC
top2 = ORIG_5MODELS.head(2)["model"].tolist()
for m in top2:
print(f"\nConfusion — {m} @ thr={thrs[m]:.3f}")
confusion_matrix_sklearn(y_ho, probs[m], threshold=thrs[m], labels=("Denied","Certified"))
=== ORIGINAL data — 5 classification models (calibrated, precision≥0.80 policy) ===
| model | roc_auc | pr_auc | ks | brier | log_loss | accuracy | precision | recall | f1 | threshold | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HistGB | 0.7790 | 0.8749 | 0.4119 | 0.1739 | 0.5169 | 0.7408 | 0.7904 | 0.8328 | 0.8110 | 0.5690 |
| 1 | Bagging | 0.7705 | 0.8678 | 0.4041 | 0.1780 | 0.5306 | 0.7325 | 0.7910 | 0.8149 | 0.8028 | 0.5068 |
| 2 | DecisionTree | 0.7661 | 0.8560 | 0.3974 | 0.1790 | 0.5335 | 0.7400 | 0.7725 | 0.8657 | 0.8165 | 0.4881 |
| 3 | RandomForest | 0.7622 | 0.8657 | 0.3917 | 0.1810 | 0.5388 | 0.7241 | 0.7707 | 0.8355 | 0.8018 | 0.5505 |
| 4 | AdaBoost | 0.6782 | 0.7654 | 0.3015 | 0.1993 | 0.5856 | 0.7060 | 0.7169 | 0.9254 | 0.8079 | 0.6132 |
Confusion — HistGB @ thr=0.569
Confusion — Bagging @ thr=0.507
Selection rule: By ROC-AUC (primary), HistGB wins — 0.7790, ahead of Bagging 0.7705, Logit 0.7630, RF 0.7622
At the cutoff: HistGB has the best overall F1/recall balance; DecisionTree pushes recall highest but trades off precision — threshold effects, not ranking
Precision floor check: Thresholds were set on TRAIN (≥0.80). On HOLD-OUT: HistGB ≈ 0.790 precision, Bagging ≈ 0.791 — noted, no changes made
Calibration: Reliability looks solid; probabilities are usable
Bottom line: Keep HistGB as the baseline reference for the original-data setting; carry this same setup into over/under-sampling and tuning
# Visualize model: tree structure, rules, and feature importances for ORIGINAL data models
# DecisionTree visualization (first tree in Bagging/RandomForest)
for name, pipe in MODEL_ZOO.items():
# Fit the pipeline on training data before accessing feature names
pipe.fit(DF_TR, y_tr)
model = pipe.named_steps["model"]
prep = pipe.named_steps["prep"]
feats = prep.get_feature_names_out()
print(f"\n=== {name} ===")
# For ensemble models, visualize the first tree if possible
tree = None
if hasattr(model, "estimators_"):
tree = model.estimators_[0]
elif isinstance(model, DecisionTreeClassifier):
tree = model
# Only plot tree if it's a DecisionTreeClassifier
if tree is not None and isinstance(tree, DecisionTreeClassifier):
plt.figure(figsize=(18, 10))
plot_tree(
tree,
feature_names=feats,
class_names=["Denied", "Certified"],
filled=True, rounded=True, proportion=True, fontsize=7,
max_depth=3
)
plt.title(f"{name} — Tree #0 (max_depth=3)")
plt.tight_layout()
plt.show()
# Print tree rules (text format, depth=3)
print(f"=== {name} — Rules for Tree #0 (max_depth=3) ===")
print(export_text(tree, feature_names=list(feats), max_depth=3))
else:
print(f"[SKIP] Tree visualization not supported for {type(model).__name__}")
# Feature importances (top 20)
if hasattr(model, "feature_importances_"):
importances = pd.Series(model.feature_importances_, index=feats)
top_imp = importances.sort_values(ascending=False).head(20)
plt.figure(figsize=(8, 6))
plt.barh(top_imp.index, top_imp.values)
plt.gca().invert_yaxis()
plt.title(f"{name} — Top 20 Feature Importances")
plt.xlabel("Importance")
plt.tight_layout()
plt.show()
=== DecisionTree ===
=== DecisionTree — Rules for Tree #0 (max_depth=3) === |--- _edu_High School <= 0.50 | |--- _job_exp_N <= 0.50 | | |--- _edu_Bachelor's <= 0.50 | | | |--- _uow_Year <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _uow_Year > 0.50 | | | | |--- truncated branch of depth 3 | | |--- _edu_Bachelor's > 0.50 | | | |--- _uow_Hour <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _uow_Hour > 0.50 | | | | |--- truncated branch of depth 3 | |--- _job_exp_N > 0.50 | | |--- _continent_Europe <= 0.50 | | | |--- _uow_Year <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _uow_Year > 0.50 | | | | |--- truncated branch of depth 3 | | |--- _continent_Europe > 0.50 | | | |--- _uow_Hour <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _uow_Hour > 0.50 | | | | |--- truncated branch of depth 3 |--- _edu_High School > 0.50 | |--- _continent_North America <= 0.50 | | |--- _continent_South America <= 0.50 | | | |--- _region_Midwest <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _region_Midwest > 0.50 | | | | |--- truncated branch of depth 3 | | |--- _continent_South America > 0.50 | | | |--- company_age <= 6.50 | | | | |--- truncated branch of depth 2 | | | |--- company_age > 6.50 | | | | |--- truncated branch of depth 3 | |--- _continent_North America > 0.50 | | |--- _job_exp_N <= 0.50 | | | |--- wage_usd_yr_cap <= 23696.56 | | | | |--- truncated branch of depth 3 | | | |--- wage_usd_yr_cap > 23696.56 | | | | |--- truncated branch of depth 3 | | |--- _job_exp_N > 0.50 | | | |--- log_company_age <= 3.81 | | | | |--- truncated branch of depth 3 | | | |--- log_company_age > 3.81 | | | | |--- truncated branch of depth 3
=== RandomForest ===
=== RandomForest — Rules for Tree #0 (max_depth=3) === |--- _edu_Doctorate <= 0.50 | |--- wage_usd_yr_cap <= 284351.00 | | |--- _region_West <= 0.50 | | | |--- _job_exp_Y <= 0.50 | | | | |--- truncated branch of depth 33 | | | |--- _job_exp_Y > 0.50 | | | | |--- truncated branch of depth 38 | | |--- _region_West > 0.50 | | | |--- _edu_High School <= 0.50 | | | | |--- truncated branch of depth 27 | | | |--- _edu_High School > 0.50 | | | | |--- truncated branch of depth 18 | |--- wage_usd_yr_cap > 284351.00 | | |--- log_wage_usd_yr <= 14.37 | | | |--- _job_exp_N <= 0.50 | | | | |--- truncated branch of depth 16 | | | |--- _job_exp_N > 0.50 | | | | |--- truncated branch of depth 29 | | |--- log_wage_usd_yr > 14.37 | | | |--- employees_clip0_cap <= 3387.50 | | | | |--- truncated branch of depth 14 | | | |--- employees_clip0_cap > 3387.50 | | | | |--- truncated branch of depth 9 |--- _edu_Doctorate > 0.50 | |--- _uow_Hour <= 0.50 | | |--- _job_train_Y <= 0.50 | | | |--- _region_South <= 0.50 | | | | |--- truncated branch of depth 16 | | | |--- _region_South > 0.50 | | | | |--- truncated branch of depth 12 | | |--- _job_train_Y > 0.50 | | | |--- wage_usd_yr_cap <= 83249.25 | | | | |--- truncated branch of depth 11 | | | |--- wage_usd_yr_cap > 83249.25 | | | | |--- truncated branch of depth 17 | |--- _uow_Hour > 0.50 | | |--- _region_Midwest <= 0.50 | | | |--- log_company_age <= 2.35 | | | | |--- class: 1.0 | | | |--- log_company_age > 2.35 | | | | |--- truncated branch of depth 7 | | |--- _region_Midwest > 0.50 | | | |--- _continent_Asia <= 0.50 | | | | |--- class: 1.0 | | | |--- _continent_Asia > 0.50 | | | | |--- class: 1.0
=== Bagging ===
=== Bagging — Rules for Tree #0 (max_depth=3) === |--- _edu_High School <= 0.50 | |--- _job_exp_N <= 0.50 | | |--- _edu_Bachelor's <= 0.50 | | | |--- _uow_Year <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _uow_Year > 0.50 | | | | |--- truncated branch of depth 3 | | |--- _edu_Bachelor's > 0.50 | | | |--- _uow_Hour <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _uow_Hour > 0.50 | | | | |--- truncated branch of depth 3 | |--- _job_exp_N > 0.50 | | |--- _continent_Europe <= 0.50 | | | |--- _uow_Year <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _uow_Year > 0.50 | | | | |--- truncated branch of depth 3 | | |--- _continent_Europe > 0.50 | | | |--- log_wage_usd_yr <= 12.15 | | | | |--- truncated branch of depth 3 | | | |--- log_wage_usd_yr > 12.15 | | | | |--- truncated branch of depth 3 |--- _edu_High School > 0.50 | |--- _continent_North America <= 0.50 | | |--- _region_Midwest <= 0.50 | | | |--- _continent_South America <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _continent_South America > 0.50 | | | | |--- truncated branch of depth 3 | | |--- _region_Midwest > 0.50 | | | |--- company_age <= 8.50 | | | | |--- truncated branch of depth 3 | | | |--- company_age > 8.50 | | | | |--- truncated branch of depth 3 | |--- _continent_North America > 0.50 | | |--- _job_exp_Y <= 0.50 | | | |--- log_company_age <= 4.03 | | | | |--- truncated branch of depth 3 | | | |--- log_company_age > 4.03 | | | | |--- truncated branch of depth 3 | | |--- _job_exp_Y > 0.50 | | | |--- wage_usd_yr_cap <= 27212.02 | | | | |--- truncated branch of depth 3 | | | |--- wage_usd_yr_cap > 27212.02 | | | | |--- truncated branch of depth 3 === AdaBoost ===
=== AdaBoost — Rules for Tree #0 (max_depth=3) === |--- _edu_High School <= 0.50 | |--- class: 1 |--- _edu_High School > 0.50 | |--- class: 0
=== HistGB === [SKIP] Tree visualization not supported for HistGradientBoostingClassifier
Model Building with Oversampled data¶
Oversampled data the same way as the original section, but using an oversampling step on TRAIN only (never the hold-out). Using an imblearn pipeline with RandomOverSampler so there’s no leakage, then calibrate, pick the policy threshold (precision ≥ 0.80), and evaluate on HOLD-OUT
# Model Building — OVERSAMPLED data (TRAIN only)
# DT, RF, Bagging, AdaBoost, HistGradientBoosting
# Calibrate on OVERSAMPLED TRAIN → pick threshold (precision ≥0.80 on TRAIN) → Evaluate on HOLD-OUT
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, HistGradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
assert "RNG" in globals() and "DF_TR" in globals() and "DF_HO" in globals(), "Run the original split first."
assert "model_performance_classification_sklearn" in globals() and "confusion_matrix_sklearn" in globals()
y_ho = DF_HO["y"].astype(int).to_numpy()
# Helper: precision ≥ floor on TRAIN (fallback Youden)
def _pick_thr_precision_floor(y_true, y_prob, p_floor=0.80):
y_true = np.asarray(y_true).astype(int); y_prob = np.asarray(y_prob).astype(float)
prec, rec, thr = precision_recall_curve(y_true, y_prob)
ok = np.where(prec[1:] >= p_floor)[0]
if ok.size:
cands = thr[ok]
f1s = [f1_score(y_true, (y_prob >= t).astype(int), zero_division=0) for t in cands]
return float(cands[int(np.argmax(f1s))])
fpr, tpr, thr = roc_curve(y_true, y_prob)
return float(thr[int(np.argmax(tpr - fpr))])
# 1) Create OVERSAMPLED TRAIN (balance classes)
def _make_os(df):
pos = df[df["y"]==1]; neg = df[df["y"]==0]
minority, majority = (neg, pos) if len(pos)>=len(neg) else (pos, neg)
mi = minority.sample(n=len(majority), replace=True, random_state=RNG)
return pd.concat([majority, mi], axis=0).sample(frac=1.0, random_state=RNG).reset_index(drop=True)
DF_TR_OS = globals().get("DF_TR_OS", _make_os(DF_TR))
y_tr_os = DF_TR_OS["y"].astype(int).to_numpy()
print(f"[Oversample] TRAIN original={len(DF_TR):,} → oversampled={len(DF_TR_OS):,}")
# 2) Preprocessors (tree-style: passthrough NUM + OHE CAT)
CAT = [c for c in ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"] if c in DF_TR_OS.columns]
NUM = [c for c in ["wage_usd_yr_cap","log_wage_usd_yr","employees_clip0_cap","log_employees","company_age","log_company_age"] if c in DF_TR_OS.columns]
pp_trees = ColumnTransformer(
[("num", "passthrough", NUM),
("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=True), CAT)],
remainder="drop", verbose_feature_names_out=False
)
# 3) Five models (class_weight=None to isolate oversampling effect)
DT = Pipeline([("prep", pp_trees),
("model", DecisionTreeClassifier(max_depth=6, class_weight=None, random_state=RNG))])
RF = Pipeline([("prep", pp_trees),
("model", RandomForestClassifier(n_estimators=600, random_state=RNG, n_jobs=-1, class_weight=None))])
BAG = Pipeline([("prep", pp_trees),
("model", BaggingClassifier(
estimator=DecisionTreeClassifier(max_depth=6, class_weight=None, random_state=RNG),
n_estimators=200, n_jobs=-1, random_state=RNG
))])
ADA = Pipeline([("prep", pp_trees),
("model", AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1, class_weight=None, random_state=RNG),
n_estimators=100, learning_rate=0.5, random_state=RNG
))])
HGB = Pipeline([("prep", pp_trees),
("model", HistGradientBoostingClassifier(
max_depth=None, learning_rate=0.06, max_iter=600,
l2_regularization=0.1, class_weight=None, random_state=RNG
))])
MODEL_ZOO_OS = {"DecisionTree": DT, "RandomForest": RF, "Bagging": BAG, "AdaBoost": ADA, "HistGB": HGB}
# 4) Train → Calibrate → Threshold on OVERSAMPLED TRAIN → Evaluate on HOLD-OUT
rows, probs, thrs = [], {}, {}
for name, pipe in MODEL_ZOO_OS.items():
cal = CalibratedClassifierCV(pipe, method="sigmoid", cv=5).fit(DF_TR_OS, y_tr_os)
thr = _pick_thr_precision_floor(y_tr_os, cal.predict_proba(DF_TR_OS)[:,1], p_floor=0.80)
proba_ho = cal.predict_proba(DF_HO)[:,1]
perf, _ = model_performance_classification_sklearn(y_ho, proba_ho, threshold=thr)
perf["model"] = name
rows.append(perf); probs[name] = proba_ho; thrs[name] = thr
OS_5MODELS = (pd.DataFrame(rows)
.loc[:, ["model","roc_auc","pr_auc","ks","brier","log_loss","accuracy","precision","recall","f1","threshold"]]
.sort_values(["roc_auc","pr_auc"], ascending=False)
.reset_index(drop=True).round(4))
print("=== OVERSAMPLED TRAIN — 5 classification models (calibrated, precision≥0.80 policy) ===")
display(OS_5MODELS)
# Confusion matrices for the top 2 by ROC-AUC
_top2 = OS_5MODELS.head(2)["model"].tolist()
for m in _top2:
print(f"\nConfusion — {m} (OS) @ thr={thrs[m]:.3f}")
confusion_matrix_sklearn(y_ho, probs[m], threshold=thrs[m], labels=("Denied","Certified"))
[Oversample] TRAIN original=20,384 → oversampled=27,228 === OVERSAMPLED TRAIN — 5 classification models (calibrated, precision≥0.80 policy) ===
| model | roc_auc | pr_auc | ks | brier | log_loss | accuracy | precision | recall | f1 | threshold | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Bagging | 0.7694 | 0.8667 | 0.4021 | 0.1969 | 0.5772 | 0.6046 | 0.8810 | 0.4718 | 0.6145 | 0.6521 |
| 1 | HistGB | 0.7681 | 0.8664 | 0.4001 | 0.1890 | 0.5580 | 0.7219 | 0.8050 | 0.7703 | 0.7873 | 0.5000 |
| 2 | DecisionTree | 0.7650 | 0.8543 | 0.3938 | 0.1981 | 0.5812 | 0.6050 | 0.8811 | 0.4724 | 0.6150 | 0.6439 |
| 3 | AdaBoost | 0.7618 | 0.8595 | 0.3889 | 0.1990 | 0.5870 | 0.5861 | 0.8825 | 0.4389 | 0.5862 | 0.6811 |
| 4 | RandomForest | 0.7563 | 0.8623 | 0.3778 | 0.1965 | 0.6018 | 0.7111 | 0.7856 | 0.7806 | 0.7831 | 0.6739 |
Confusion — Bagging (OS) @ thr=0.652
Confusion — HistGB (OS) @ thr=0.500
Ranking rule: Bagging 0.7694 ROC-AUC vs HistGB 0.7681 — Bagging still edges it, but neither beats the original runs (HistGB 0.7790, Logit 0.7630, RF 0.7622). PR-AUC is basically flat
At the policy cutoff (precision ≥ 0.80 set on TRAIN):
Bagging (oversampled): precision 0.8810, recall 0.4718 — recall tanks
HistGB (oversampled): precision 0.8050, recall 0.7703 — more balanced, but a hair worse than original RF
Bottom line: Oversampling added variance and no upside here from the original class-weight setup
# Visualize Bagging model: tree structure, rules, and feature importances
# Fit the Bagging pipeline if not already fitted
BAG.fit(DF_TR, y_tr)
bag_model = BAG.named_steps["model"]
bag_prep = BAG.named_steps["prep"]
bag_feats = bag_prep.get_feature_names_out()
# Visualize the first tree in the Bagging ensemble
tree_idx = 0
tree0 = bag_model.estimators_[tree_idx]
plt.figure(figsize=(18, 10))
plot_tree(
tree0,
feature_names=bag_feats,
class_names=["Denied", "Certified"],
filled=True, rounded=True, proportion=True, fontsize=7,
max_depth=3
)
plt.title(f"Bagging — Tree #{tree_idx} (max_depth=3)")
plt.tight_layout()
plt.show()
# Print tree rules (text format, depth=3)
print(f"=== Bagging — Rules for Tree #{tree_idx} (max_depth=3) ===")
print(export_text(tree0, feature_names=list(bag_feats), max_depth=3))
# Feature importances (top 20)
if hasattr(tree0, "feature_importances_"):
importances = pd.Series(tree0.feature_importances_, index=bag_feats)
top_imp = importances.sort_values(ascending=False).head(20)
plt.figure(figsize=(8, 6))
plt.barh(top_imp.index, top_imp.values)
plt.gca().invert_yaxis()
plt.title("Bagging — Top 20 Feature Importances (Tree #0)")
plt.xlabel("Importance")
plt.tight_layout()
plt.show()
=== Bagging — Rules for Tree #0 (max_depth=3) === |--- _edu_High School <= 0.50 | |--- _job_exp_Y <= 0.50 | | |--- _uow_Year <= 0.50 | | | |--- _edu_Doctorate <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _edu_Doctorate > 0.50 | | | | |--- truncated branch of depth 3 | | |--- _uow_Year > 0.50 | | | |--- _continent_Europe <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _continent_Europe > 0.50 | | | | |--- truncated branch of depth 3 | |--- _job_exp_Y > 0.50 | | |--- _edu_Bachelor's <= 0.50 | | | |--- _uow_Year <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _uow_Year > 0.50 | | | | |--- truncated branch of depth 3 | | |--- _edu_Bachelor's > 0.50 | | | |--- _uow_Hour <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _uow_Hour > 0.50 | | | | |--- truncated branch of depth 3 |--- _edu_High School > 0.50 | |--- _continent_North America <= 0.50 | | |--- _region_Midwest <= 0.50 | | | |--- _continent_Asia <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _continent_Asia > 0.50 | | | | |--- truncated branch of depth 3 | | |--- _region_Midwest > 0.50 | | | |--- log_company_age <= 4.87 | | | | |--- truncated branch of depth 3 | | | |--- log_company_age > 4.87 | | | | |--- truncated branch of depth 3 | |--- _continent_North America > 0.50 | | |--- _job_exp_Y <= 0.50 | | | |--- company_age <= 55.50 | | | | |--- truncated branch of depth 3 | | | |--- company_age > 55.50 | | | | |--- truncated branch of depth 3 | | |--- _job_exp_Y > 0.50 | | | |--- log_wage_usd_yr <= 10.21 | | | | |--- truncated branch of depth 2 | | | |--- log_wage_usd_yr > 10.21 | | | | |--- truncated branch of depth 3
Model Building with Undersampled data¶
# Model Building — UNDERSAMPLED data (TRAIN only)
# DT, RF, Bagging, AdaBoost, HistGradientBoosting
# Calibrate on UNDERSAMPLED TRAIN → pick threshold (precision ≥0.80 on TRAIN) → Evaluate on HOLD-OUT
assert "RNG" in globals() and "DF_TR" in globals() and "DF_HO" in globals(), "Run the original split first."
assert "model_performance_classification_sklearn" in globals() and "confusion_matrix_sklearn" in globals()
y_ho = DF_HO["y"].astype(int).to_numpy()
# ---------- Helper: precision ≥ floor on TRAIN (fallback Youden)
def _pick_thr_precision_floor(y_true, y_prob, p_floor=0.80):
y_true = np.asarray(y_true).astype(int); y_prob = np.asarray(y_prob).astype(float)
prec, rec, thr = precision_recall_curve(y_true, y_prob)
ok = np.where(prec[1:] >= p_floor)[0]
if ok.size:
cands = thr[ok]
f1s = [f1_score(y_true, (y_prob >= t).astype(int), zero_division=0) for t in cands]
return float(cands[int(np.argmax(f1s))])
fpr, tpr, thr = roc_curve(y_true, y_prob)
return float(thr[int(np.argmax(tpr - fpr))])
# ---------- 1) Create UNDERSAMPLED TRAIN (balance by downsampling majority)
def _make_us(df):
pos = df[df["y"]==1]; neg = df[df["y"]==0]
majority, minority = (pos, neg) if len(pos)>=len(neg) else (neg, pos)
maj = majority.sample(n=len(minority), replace=False, random_state=RNG)
return pd.concat([maj, minority], axis=0).sample(frac=1.0, random_state=RNG).reset_index(drop=True)
DF_TR_US = globals().get("DF_TR_US", _make_us(DF_TR))
y_tr_us = DF_TR_US["y"].astype(int).to_numpy()
print(f"[Undersample] TRAIN original={len(DF_TR):,} → undersampled={len(DF_TR_US):,}")
# ---------- 2) Preprocessors (tree-style: passthrough NUM + OHE CAT)
CAT = [c for c in ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"] if c in DF_TR_US.columns]
NUM = [c for c in ["wage_usd_yr_cap","log_wage_usd_yr","employees_clip0_cap","log_employees","company_age","log_company_age"] if c in DF_TR_US.columns]
pp_trees = ColumnTransformer(
[("num", "passthrough", NUM),
("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=True), CAT)],
remainder="drop", verbose_feature_names_out=False
)
# ---------- 3) Five models (class_weight=None to isolate undersampling effect)
DT = Pipeline([("prep", pp_trees),
("model", DecisionTreeClassifier(max_depth=6, class_weight=None, random_state=RNG))])
RF = Pipeline([("prep", pp_trees),
("model", RandomForestClassifier(n_estimators=600, random_state=RNG, n_jobs=-1, class_weight=None))])
BAG = Pipeline([("prep", pp_trees),
("model", BaggingClassifier(
estimator=DecisionTreeClassifier(max_depth=6, class_weight=None, random_state=RNG),
n_estimators=200, n_jobs=-1, random_state=RNG
))])
ADA = Pipeline([("prep", pp_trees),
("model", AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1, class_weight=None, random_state=RNG),
n_estimators=100, learning_rate=0.5, random_state=RNG
))])
HGB = Pipeline([("prep", pp_trees),
("model", HistGradientBoostingClassifier(
max_depth=None, learning_rate=0.06, max_iter=600,
l2_regularization=0.1, class_weight=None, random_state=RNG
))])
MODEL_ZOO_US = {"DecisionTree": DT, "RandomForest": RF, "Bagging": BAG, "AdaBoost": ADA, "HistGB": HGB}
# ---------- 4) Train → Calibrate → Threshold on UNDERSAMPLED TRAIN → Evaluate on HOLD-OUT
rows, probs, thrs = [], {}, {}
for name, pipe in MODEL_ZOO_US.items():
cal = CalibratedClassifierCV(pipe, method="sigmoid", cv=5).fit(DF_TR_US, y_tr_us)
thr = _pick_thr_precision_floor(y_tr_us, cal.predict_proba(DF_TR_US)[:,1], p_floor=0.80)
proba_ho = cal.predict_proba(DF_HO)[:,1]
perf, _ = model_performance_classification_sklearn(y_ho, proba_ho, threshold=thr)
perf["model"] = name
rows.append(perf); probs[name] = proba_ho; thrs[name] = thr
US_5MODELS = (pd.DataFrame(rows)
.loc[:, ["model","roc_auc","pr_auc","ks","brier","log_loss","accuracy","precision","recall","f1","threshold"]]
.sort_values(["roc_auc","pr_auc"], ascending=False)
.reset_index(drop=True).round(4))
print("=== UNDERSAMPLED TRAIN — 5 classification models (calibrated, precision≥0.80 policy) ===")
display(US_5MODELS)
# Confusion matrices for the top 2 by ROC-AUC
_top2 = US_5MODELS.head(2)["model"].tolist()
for m in _top2:
print(f"\nConfusion — {m} (US) @ thr={thrs[m]:.3f}")
confusion_matrix_sklearn(y_ho, probs[m], threshold=thrs[m], labels=("Denied","Certified"))
[Undersample] TRAIN original=20,384 → undersampled=13,540 === UNDERSAMPLED TRAIN — 5 classification models (calibrated, precision≥0.80 policy) ===
| model | roc_auc | pr_auc | ks | brier | log_loss | accuracy | precision | recall | f1 | threshold | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HistGB | 0.7765 | 0.8743 | 0.4162 | 0.1940 | 0.5658 | 0.6391 | 0.8631 | 0.5464 | 0.6692 | 0.5850 |
| 1 | Bagging | 0.7705 | 0.8695 | 0.4044 | 0.1972 | 0.5779 | 0.6058 | 0.8826 | 0.4727 | 0.6156 | 0.6342 |
| 2 | DecisionTree | 0.7650 | 0.8596 | 0.3877 | 0.1981 | 0.5803 | 0.6044 | 0.8805 | 0.4718 | 0.6144 | 0.6437 |
| 3 | AdaBoost | 0.7620 | 0.8598 | 0.3908 | 0.1985 | 0.5861 | 0.5842 | 0.8827 | 0.4354 | 0.5831 | 0.6782 |
| 4 | RandomForest | 0.7614 | 0.8660 | 0.3895 | 0.2005 | 0.5848 | 0.6752 | 0.8319 | 0.6439 | 0.7259 | 0.5313 |
Confusion — HistGB (US) @ thr=0.585
Confusion — Bagging (US) @ thr=0.634
Ranking rule: HistGB 0.7763 ROC-AUC, Bagging 0.7701 — basically the same as the original (HistGB 0.7790, Logit 0.7630, RF 0.7622); no lift
At the cutoff precision ≥ 0.80 set on TRAIN:
Bagging undersampled: precision 0.8824, recall 0.4739 still recall tanks
RF undersampled: precision 0.8303, recall 0.6513, F1 0.7300 meaning more balanced than Bagging-US, still not better than original RF
Bottom line: Undersampling drops data and does not help here sticking with the original class-weight + calibrated Logit as the reference, and treat OS/US as negative controls
# Visualize UNDERSAMPLED TRAIN
for name, pipe in MODEL_ZOO_US.items():
# Fit the pipeline on undersampled training data before accessing feature names
pipe.fit(DF_TR_US, y_tr_us)
model = pipe.named_steps["model"]
prep = pipe.named_steps["prep"]
feats = prep.get_feature_names_out()
print(f"\n=== {name} (Undersampled) ===")
# For ensemble models, visualize the first tree if possible
tree = None
if hasattr(model, "estimators_"):
tree = model.estimators_[0]
elif isinstance(model, DecisionTreeClassifier):
tree = model
# Only plot tree if it's a DecisionTreeClassifier
if tree is not None and isinstance(tree, DecisionTreeClassifier):
plt.figure(figsize=(18, 10))
plot_tree(
tree,
feature_names=feats,
class_names=["Denied", "Certified"],
filled=True, rounded=True, proportion=True, fontsize=7,
max_depth=3
)
plt.title(f"{name} (Undersampled) — Tree #0 (max_depth=3)")
plt.tight_layout()
plt.show()
# Print tree rules (text format, depth=3)
print(f"=== {name} (Undersampled) — Rules for Tree #0 (max_depth=3) ===")
print(export_text(tree, feature_names=list(feats), max_depth=3))
else:
print(f"[SKIP] Tree visualization not supported for {type(model).__name__}")
# Feature importances (top 20)
if hasattr(model, "feature_importances_"):
importances = pd.Series(model.feature_importances_, index=feats)
top_imp = importances.sort_values(ascending=False).head(20)
plt.figure(figsize=(8, 6))
plt.barh(top_imp.index, top_imp.values)
plt.gca().invert_yaxis()
plt.title(f"{name} (Undersampled) — Top 20 Feature Importances")
plt.xlabel("Importance")
plt.tight_layout()
plt.show()
=== DecisionTree (Undersampled) ===
=== DecisionTree (Undersampled) — Rules for Tree #0 (max_depth=3) === |--- _edu_High School <= 0.50 | |--- _job_exp_Y <= 0.50 | | |--- _uow_Year <= 0.50 | | | |--- _edu_Doctorate <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _edu_Doctorate > 0.50 | | | | |--- truncated branch of depth 3 | | |--- _uow_Year > 0.50 | | | |--- _continent_Europe <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _continent_Europe > 0.50 | | | | |--- truncated branch of depth 3 | |--- _job_exp_Y > 0.50 | | |--- _edu_Bachelor's <= 0.50 | | | |--- _uow_Year <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _uow_Year > 0.50 | | | | |--- truncated branch of depth 3 | | |--- _edu_Bachelor's > 0.50 | | | |--- _uow_Hour <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _uow_Hour > 0.50 | | | | |--- truncated branch of depth 3 |--- _edu_High School > 0.50 | |--- _continent_Asia <= 0.50 | | |--- _continent_Europe <= 0.50 | | | |--- _job_exp_Y <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _job_exp_Y > 0.50 | | | | |--- truncated branch of depth 3 | | |--- _continent_Europe > 0.50 | | | |--- log_employees <= 7.27 | | | | |--- truncated branch of depth 3 | | | |--- log_employees > 7.27 | | | | |--- truncated branch of depth 3 | |--- _continent_Asia > 0.50 | | |--- _region_Northeast <= 0.50 | | | |--- _region_West <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _region_West > 0.50 | | | | |--- truncated branch of depth 3 | | |--- _region_Northeast > 0.50 | | | |--- wage_usd_yr_cap <= 504597.28 | | | | |--- truncated branch of depth 3 | | | |--- wage_usd_yr_cap > 504597.28 | | | | |--- truncated branch of depth 3
=== RandomForest (Undersampled) ===
=== RandomForest (Undersampled) — Rules for Tree #0 (max_depth=3) === |--- _edu_Doctorate <= 0.50 | |--- wage_usd_yr_cap <= 252093.80 | | |--- _region_West <= 0.50 | | | |--- _job_exp_Y <= 0.50 | | | | |--- truncated branch of depth 30 | | | |--- _job_exp_Y > 0.50 | | | | |--- truncated branch of depth 40 | | |--- _region_West > 0.50 | | | |--- _edu_High School <= 0.50 | | | | |--- truncated branch of depth 28 | | | |--- _edu_High School > 0.50 | | | | |--- truncated branch of depth 18 | |--- wage_usd_yr_cap > 252093.80 | | |--- _job_exp_Y <= 0.50 | | | |--- _uow_Hour <= 0.50 | | | | |--- truncated branch of depth 10 | | | |--- _uow_Hour > 0.50 | | | | |--- truncated branch of depth 22 | | |--- _job_exp_Y > 0.50 | | | |--- _uow_Hour <= 0.50 | | | | |--- truncated branch of depth 12 | | | |--- _uow_Hour > 0.50 | | | | |--- truncated branch of depth 15 |--- _edu_Doctorate > 0.50 | |--- wage_usd_yr_cap <= 213888.59 | | |--- log_wage_usd_yr <= 10.36 | | | |--- log_company_age <= 4.48 | | | | |--- truncated branch of depth 10 | | | |--- log_company_age > 4.48 | | | | |--- class: 1.0 | | |--- log_wage_usd_yr > 10.36 | | | |--- _region_Island <= 0.50 | | | | |--- truncated branch of depth 18 | | | |--- _region_Island > 0.50 | | | | |--- truncated branch of depth 3 | |--- wage_usd_yr_cap > 213888.59 | | |--- _region_Northeast <= 0.50 | | | |--- company_age <= 9.50 | | | | |--- class: 1.0 | | | |--- company_age > 9.50 | | | | |--- truncated branch of depth 7 | | |--- _region_Northeast > 0.50 | | | |--- log_company_age <= 2.92 | | | | |--- truncated branch of depth 4 | | | |--- log_company_age > 2.92 | | | | |--- truncated branch of depth 4
=== Bagging (Undersampled) ===
=== Bagging (Undersampled) — Rules for Tree #0 (max_depth=3) === |--- _edu_High School <= 0.50 | |--- _job_exp_Y <= 0.50 | | |--- _uow_Hour <= 0.50 | | | |--- _continent_Europe <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _continent_Europe > 0.50 | | | | |--- truncated branch of depth 3 | | |--- _uow_Hour > 0.50 | | | |--- log_employees <= 7.96 | | | | |--- truncated branch of depth 3 | | | |--- log_employees > 7.96 | | | | |--- truncated branch of depth 3 | |--- _job_exp_Y > 0.50 | | |--- _edu_Bachelor's <= 0.50 | | | |--- _uow_Year <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _uow_Year > 0.50 | | | | |--- truncated branch of depth 3 | | |--- _edu_Bachelor's > 0.50 | | | |--- _uow_Year <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _uow_Year > 0.50 | | | | |--- truncated branch of depth 3 |--- _edu_High School > 0.50 | |--- _continent_North America <= 0.50 | | |--- _region_Northeast <= 0.50 | | | |--- _region_West <= 0.50 | | | | |--- truncated branch of depth 3 | | | |--- _region_West > 0.50 | | | | |--- truncated branch of depth 3 | | |--- _region_Northeast > 0.50 | | | |--- log_wage_usd_yr <= 8.67 | | | | |--- truncated branch of depth 2 | | | |--- log_wage_usd_yr > 8.67 | | | | |--- truncated branch of depth 3 | |--- _continent_North America > 0.50 | | |--- _job_exp_Y <= 0.50 | | | |--- employees_clip0_cap <= 7673.50 | | | | |--- truncated branch of depth 3 | | | |--- employees_clip0_cap > 7673.50 | | | | |--- truncated branch of depth 2 | | |--- _job_exp_Y > 0.50 | | | |--- log_wage_usd_yr <= 9.84 | | | | |--- truncated branch of depth 3 | | | |--- log_wage_usd_yr > 9.84 | | | | |--- truncated branch of depth 3 === AdaBoost (Undersampled) ===
=== AdaBoost (Undersampled) — Rules for Tree #0 (max_depth=3) === |--- _edu_High School <= 0.50 | |--- class: 1 |--- _edu_High School > 0.50 | |--- class: 0
=== HistGB (Undersampled) === [SKIP] Tree visualization not supported for HistGradientBoostingClassifier
Hyperparameter Tuning¶
Best practices for hyperparameter tuning in AdaBoost:
n_estimators:
Start with a specific number (50 is used in general) and increase in steps: 50, 75, 85, 100
Use fewer estimators (e.g., 50 to 100) if using complex base learners (like deeper decision trees)
Use more estimators (e.g., 100 to 150) when learning rate is low (e.g., 0.1 or lower)
Avoid very high values unless performance keeps improving on validation
learning_rate:
Common values to try: 1.0, 0.5, 0.1, 0.01
Use 1.0 for faster training, suitable for fewer estimators
Use 0.1 or 0.01 when using more estimators to improve generalization
Avoid very small values (< 0.01) unless you plan to use many estimators (e.g., >500) and have sufficient data
Best practices for hyperparameter tuning in Random Forest:
n_estimators:
- Start with a specific number (50 is used in general) and increase in steps: 50, 75, 100, 125
- Higher values generally improve performance but increase training time
- Use 100-150 for large datasets or when variance is high
min_samples_leaf:
- Try values like: 1, 2, 4, 5, 10
- Higher values reduce model complexity and help prevent overfitting
- Use 1–2 for low-bias models, higher (like 5 or 10) for more regularized models
- Works well in noisy datasets to smooth predictions
max_features:
- Try values:
"sqrt"(default for classification),"log2",None, or float values (e.g.,0.3,0.5) "sqrt"balances between diversity and performance for classification tasks- Lower values (e.g.,
0.3) increase tree diversity, reducing overfitting - Higher values (closer to
1.0) may capture more interactions but risk overfitting
max_samples (for bootstrap sampling):
- Try float values between
0.5to1.0or fixed integers - Use
0.6–0.9to introduce randomness and reduce overfitting - Smaller values increase diversity between trees, improving generalization
Best practices for hyperparameter tuning in Gradient Boosting:
n_estimators:
- Start with 100 (default) and increase: 100, 200, 300, 500
- Typically, higher values lead to better performance, but they also increase training time
- Use 200–500 for larger datasets or complex problems
- Monitor validation performance to avoid overfitting, as too many estimators can degrade generalization
learning_rate:
- Common values to try: 0.1, 0.05, 0.01, 0.005
- Use lower values (e.g., 0.01 or 0.005) if you are using many estimators (e.g., > 200)
- Higher learning rates (e.g., 0.1) can be used with fewer estimators for faster convergence
- Always balance the learning rate with
n_estimatorsto prevent overfitting or underfitting
subsample:
- Common values: 0.7, 0.8, 0.9, 1.0
- Use a value between
0.7and0.9for improved generalization by introducing randomness 1.0uses the full dataset for each boosting round, potentially leading to overfitting- Reducing
subsamplecan help reduce overfitting, especially in smaller datasets
max_features:
- Common values:
"sqrt","log2", or float (e.g.,0.3,0.5) "sqrt"(default) works well for classification tasks- Lower values (e.g.,
0.3) help reduce overfitting by limiting the number of features considered at each split
Best practices for hyperparameter tuning in XGBoost:
n_estimators:
- Start with 50 and increase in steps: 50,75,100,125.
- Use more estimators (e.g., 150-250) when using lower learning rates
- Monitor validation performance
- High values improve learning but increase training time
subsample:
- Common values: 0.5, 0.7, 0.8, 1.0
- Use
0.7–0.9to introduce randomness and reduce overfitting 1.0uses the full dataset in each boosting round; may overfit on small datasets- Values < 0.5 are rarely useful unless dataset is very large
gamma:
- Try values: 0 (default), 1, 3, 5, 8
- Controls minimum loss reduction needed for a split
- Higher values make the algorithm more conservative (i.e., fewer splits)
- Use values > 0 to regularize and reduce overfitting, especially on noisy data
colsample_bytree:
- Try values: 0.3, 0.5, 0.7, 1.0
- Fraction of features sampled per tree
- Lower values (e.g., 0.3 or 0.5) increase randomness and improve generalization
- Use
1.0when you want all features considered for every tree
colsample_bylevel:
- Try values: 0.3, 0.5, 0.7, 1.0
- Fraction of features sampled at each tree level (i.e., per split depth)
- Lower values help in regularization and reducing overfitting
- Often used in combination with
colsample_bytreefor fine control over feature sampling
# Hyperparameter Tuning — Top 3 models (tune once), calibrate, evaluate, visualize
assert "RNG" in globals() and "DF_TR" in globals() and "DF_HO" in globals()
assert "model_performance_classification_sklearn" in globals() and "confusion_matrix_sklearn" in globals()
from sklearn.model_selection import RandomizedSearchCV
y_tr = DF_TR["y"].astype(int).to_numpy()
y_ho = DF_HO["y"].astype(int).to_numpy()
# ---------- Helpers ----------
def pick_thr_precision_floor(y_true, y_prob, p_floor=0.80):
y_true = np.asarray(y_true).astype(int); y_prob = np.asarray(y_prob).astype(float)
prec, rec, thr = precision_recall_curve(y_true, y_prob)
ok = np.where(prec[1:] >= p_floor)[0]
if ok.size:
cands = thr[ok]
f1s = [f1_score(y_true, (y_prob >= t).astype(int), zero_division=0) for t in cands]
return float(cands[int(np.argmax(f1s))])
fpr, tpr, thr = roc_curve(y_true, y_prob)
return float(thr[int(np.argmax(tpr - fpr))])
def eval_at(y_true, y_prob, thr):
return model_performance_classification_sklearn(y_true, y_prob, threshold=thr)[0]
# Create OS/US TRAIN (if not already)
def _make_os(df):
pos, neg = df[df["y"]==1], df[df["y"]==0]
minority, majority = (neg, pos) if len(pos)>=len(neg) else (pos, neg)
mi = minority.sample(n=len(majority), replace=True, random_state=RNG)
return pd.concat([majority, mi], axis=0).sample(frac=1.0, random_state=RNG).reset_index(drop=True)
def _make_us(df):
pos, neg = df[df["y"]==1], df[df["y"]==0]
majority, minority = (pos, neg) if len(pos)>=len(neg) else (neg, pos)
maj = majority.sample(n=len(minority), replace=False, random_state=RNG)
return pd.concat([maj, minority], axis=0).sample(frac=1.0, random_state=RNG).reset_index(drop=True)
DF_TR_OS = globals().get("DF_TR_OS", _make_os(DF_TR))
DF_TR_US = globals().get("DF_TR_US", _make_us(DF_TR))
# Tree-style prep (NUM passthrough + OHE CAT)
CAT = [c for c in ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"] if c in DF_TR.columns]
NUM = [c for c in ["wage_usd_yr_cap","log_wage_usd_yr","employees_clip0_cap","log_employees","company_age","log_company_age"] if c in DF_TR.columns]
pp_trees = ColumnTransformer([("num","passthrough",NUM),
("cat",OneHotEncoder(handle_unknown="ignore", sparse_output=True), CAT)],
remainder="drop", verbose_feature_names_out=False)
# ---------- Choose your top 3 (adjust if desired) ----------
# ('model_name', 'setting'): setting ∈ {'orig','os','us'}
CANDIDATES = [
("HistGB", "orig"), # HistGB on Original
("HistGB", "us"), # HistGB on Undersampled
("Bagging", "os"), # Bagging on Oversampled
]
def build_pipeline(name, setting):
# class_weight only on original to respect natural skew
cw_tree = "balanced" if setting=="orig" else None
cw_rf = "balanced_subsample" if setting=="orig" else None
if name=="HistGB":
est = HistGradientBoostingClassifier(random_state=RNG, class_weight=cw_tree)
elif name=="Bagging":
est = BaggingClassifier(
estimator=DecisionTreeClassifier(max_depth=6, class_weight=cw_tree, random_state=RNG),
n_estimators=200, n_jobs=-1, random_state=RNG
)
elif name=="RandomForest":
est = RandomForestClassifier(n_estimators=600, random_state=RNG, n_jobs=-1, class_weight=cw_rf)
elif name=="DecisionTree":
est = DecisionTreeClassifier(max_depth=6, class_weight=cw_tree, random_state=RNG)
elif name=="AdaBoost":
est = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1, class_weight=cw_tree, random_state=RNG),
random_state=RNG)
else:
raise ValueError(name)
return Pipeline([("prep", pp_trees), ("model", est)])
# Small, sane grids (one pass)
GRIDS = {
"HistGB": {
"model__learning_rate":[0.03,0.06,0.1],
"model__max_depth":[None,6,10],
"model__l2_regularization":[0.0,0.1,0.5],
"model__max_iter":[400,600,800],
},
"Bagging": {
"model__n_estimators":[150,250,350],
# base tree depth tweak via replacement if desired:
# "model__base_estimator__max_depth":[4,6,8],
},
"RandomForest": {
"model__n_estimators":[400,600,800],
"model__max_depth":[None,12,18],
"model__min_samples_leaf":[1,3,10],
},
"DecisionTree": {
"model__max_depth":[3,6,9,12],
"model__min_samples_leaf":[1,5,20],
},
"AdaBoost": {
"model__n_estimators":[50,75,100,150],
"model__learning_rate":[1.0,0.5,0.1],
},
}
def get_train(sett):
if sett=="orig": return DF_TR, y_tr
if sett=="os": return DF_TR_OS, DF_TR_OS["y"].astype(int).to_numpy()
if sett=="us": return DF_TR_US, DF_TR_US["y"].astype(int).to_numpy()
raise ValueError(sett)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RNG)
rows = []
viz_conf = [] # store (label, probs, thr) for confusion plots
for name, setting in CANDIDATES:
Xtr, ytr = get_train(setting)
pipe = build_pipeline(name, setting)
grid = GRIDS[name]
# Randomized search for larger grids, else grid search
n_cands = np.prod([len(v) for v in grid.values()])
search = (RandomizedSearchCV(pipe, grid, n_iter=min(20, n_cands), scoring="roc_auc",
cv=cv, n_jobs=-1, random_state=RNG, refit=True, verbose=0)
if n_cands > 30 else
GridSearchCV(pipe, grid, scoring="roc_auc", cv=cv, n_jobs=-1, refit=True, verbose=0))
# Fit
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=FutureWarning)
search.fit(Xtr, ytr)
# Calibrate on ORIGINAL TRAIN for consistent policy & thresholding
cal = CalibratedClassifierCV(search.best_estimator_, method="sigmoid", cv=5).fit(DF_TR, y_tr)
thr = pick_thr_precision_floor(y_tr, cal.predict_proba(DF_TR)[:,1], p_floor=0.80)
proba_ho = cal.predict_proba(DF_HO)[:,1]
perf = eval_at(y_ho, proba_ho, thr)
perf.update(model=name, setting=setting, best_params=search.best_params_)
rows.append(perf)
viz_conf.append((f"{name} ({setting})", proba_ho, thr))
TUNED3 = (pd.DataFrame(rows)
.loc[:, ["model","setting","best_params","roc_auc","pr_auc","accuracy","precision","recall","f1","brier","log_loss","threshold"]]
.sort_values(["roc_auc","pr_auc"], ascending=False)
.reset_index(drop=True).round(4))
print("=== Tuned top-3 — calibrated, policy thresholded (HOLD-OUT) ===")
display(TUNED3)
# Confusion matrices for each tuned model
for lbl, ph, thr in viz_conf:
print(f"\nConfusion — {lbl} @ thr={thr:.3f}")
confusion_matrix_sklearn(y_ho, ph, threshold=thr, labels=("Denied","Certified"))
=== Tuned top-3 — calibrated, policy thresholded (HOLD-OUT) ===
| model | setting | best_params | roc_auc | pr_auc | accuracy | precision | recall | f1 | brier | log_loss | threshold | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HistGB | orig | {'model__max_iter': 400, 'model__max_depth': 1... | 0.7792 | 0.8751 | 0.7396 | 0.7913 | 0.8287 | 0.8096 | 0.1739 | 0.5167 | 0.5730 |
| 1 | HistGB | us | {'model__max_iter': 600, 'model__max_depth': 1... | 0.7788 | 0.8749 | 0.7402 | 0.7900 | 0.8323 | 0.8106 | 0.1738 | 0.5168 | 0.5683 |
| 2 | Bagging | os | {'model__n_estimators': 350} | 0.7717 | 0.8697 | 0.7392 | 0.7909 | 0.8287 | 0.8094 | 0.1785 | 0.5331 | 0.5447 |
Confusion — HistGB (orig) @ thr=0.573
Confusion — HistGB (us) @ thr=0.568
Confusion — Bagging (os) @ thr=0.545
Model Performance Summary and Final Model Selection¶
Selected HistGradientBoosting Original, tuned as the final model as it achieved:
ROC-AUC = 0.7792
PR-AUC = 0.8751 on the hold-out
Precision = 0.791
Recall = 0.829
F1 = 0.810
Threshold = 0.573 calibrated with threshold set on TRAIN with precision great than or equal to 0.80
The tuned US and OS variants were close but did not exceed this AUC kept the calibrated probabilities and the precision-floor policy for deployment.
Actionable Insights and Recommendations¶
Recommendation: prioritize applicants with higher education such as Master or Doctorate as well as yearly wage unit and strong company profiles with size and age these factors drive certification odds.
Power Ahead